Tuesday, December 8, 2009

Programming Recipes: Dataflow graph!


Yay! Now I can get down to one of the most useful of all recipes thus far! It's called a dataflow graph, or at least, that's what I call it.

Motivation

So, it's just another typical day at company XYZ, and you're writing some UI code. It's getting a bit hairy, because you have to keep track of updating things on the UI according to changes made to values...
procedure TfrmMain.seAmountChanged(Sender: TObject);
begin
  HandleAmountChanged;
end;

procedure TfrmMain.HandleAmountChanged;
begin
  UpdateSubtotals;
end;

function TfrmMain.GetSubTotal: extended;
begin
  Result := GetProductValue + GetOptionValue;
end;

procedure TfrmMain.UpdateSubTotals;
begin
  lblSubtotal1.Caption := Format('%.2m',[Subtotal]);
  lblSubtotal2.Caption := Format('%.2m',[Subtotal + ShippingTotal]);

  UpdateGrandTotal;
end;

procedure TfrmMain.HandleShippingFlagChanged;
begin
  UpdateGrandTotal;
  UpdateChart;
end;

Yeah, well, you get the idea. Your form is full of values that need to update as the user makes changes or whatnot. Man! All these HandleXXXChanged() and UpdateXXX() procedures are getting complicated! Plus, sometimes different changes require updating values in common, and a value gets updated more than once! That's inefficient! Now I have to add a ton of boolean fields to the form to avoid that, not to mention working out the coordination of the flags with procedure calls...oh, this could be a nightmare!

Or could it be?

There's so much common, repetitive structure to this system. Surely, we could automate something here. As a matter of fact, we can!

Brainstorming

So, what's the deal? Hmm... We have a set of values; each must auto-update when any changes to values that it depends on -- its dependancy values -- occur. Also, we don't want any of the auto-updated values to be computed multiple times, because that's inefficient.

So, we must traverse through a series of updates, preserving the updating order to always update the dependancy values before any of their dependant values are updated. This will allow us to only have to update each single value once.

How will we keep track of dependancies and dependants? We'll have to specify them, since we don't have code reflection for expressions, which would have allowed us to auto-derive the dependancies from expressions. Also, we will want to specify the "formula" for each value only once in the code -- NO duplication, nor any copy and paste. This will avoid inconsistencies and will reduce code size. The best place to specify dependancies is right next to the formula, for easy manual "bookeeping" and checking.

Discussion

So, basically, we have a collection of values that we must always keep consistent with a "formula" that defines them. When a dependancy changes, it's dependant values, to keep consistent with their formulas, will recompute, regarding their formulas as realtime-enforced constraints. So, it's like the new values -- data -- flows, from dependancies, to dependants, automatically. This is called the "dataflow paradigm" (see Wikipedia's article).

And because dependancies can be reused by multiple dependants, the dataflow network structure, rather than forming a tree that branches from dependants to dependancies, forms a graph -- a directed, acyclic graph (no circular references in any formulas). Thus we call this system a "dataflow graph" that we'll design and leverage.

Herein is the abstraction that dataflow graphs provide: Instead of it feeling like we're manually assigning the value to the formula, it feels like we specify the formula for a value only once in the code, and the value automatically keeps itself consistent with its formula, over time!

This should be what we want! A system for keeping track of updating for us!

Getting to Work

First, we start with our declarations:
unit uDG;

interface

type
  //index to a node:
  INode = integer;
...


Now, we want the formula for a value to be close to where we specify the dependancies for it, so we're going to use callbacks:
...
  TNodeAction = (naReadDependancies, naUpdate);

  //manages the implementation of a single node
  //in the dataflow graph:
  TNode = class;

  //our callback for nodes:
  TNodeHandler = procedure(ANode: TNode; AAction: TNodeAction) of object;

To construct our dataflow graph, we're going to have to describe the properties of each node:
...
  //description of a node:
  TNodeDesc = record
    ID: INode;
    EagerlyUpdated: boolean;
    Handler: TNodeHandler;
  end;

And now, the declaration for the TNode Class:
...
  TNode = class
  private
    //reference to its parent TNodes container:
    FList: TNodes;

    //The node's index-ID:
    FID: INode;

    //Maintained lists for all dependancy and dependant values
    //of a node (for efficiency's sake):
    FDependants,
    FDependancies: TINodeSet;

    //The callback:
    FHandler: TNodeHandler;

    //whether is ok to call the ReadDependancies() method of this object:
    FCanReadDependancies,

    //Whether the node, and its dependants
    //should update eagerly when invalidated:
    FEagerlyUpdated,

    //Whether the node is currently in need of being updated:
    FValid: boolean;

    //helper functions as follows...

    function AllDependanciesValid: boolean;

    procedure EnsureInvalidated;
    procedure EnsureDependantsInvalidated;

    procedure EnsureDependantsUpdated;
  public
    constructor Create(AList: TNodes; ADesc: TNodeDesc);
    destructor Destroy; override;

    { we call this function ONLY in the callback, ONLY when asked to get
      a node's dependancies; This procedure can only be called in a callback,
      when the callback is asked to provide dependancies;  if called out of
      context, a runtime error will be thrown (to prevent silent bugs!): }
    procedure ReadDependancies(ADependancies: array of INode);

    { we call the function below after assigning to a variable that has no
      dependancies; if it has dependancies, this function will be called
      automatically: }
    procedure HandleSet;
      { outside of transactions, it performs eager invalidation of dependants,
        followed by eager update of them.

        inside transactions: Performs eager invalidation of dependants,
        deferring updates to the end of the transaction (useful when assigning
        values to multiple independant nodes at once) }

    { the following function performs lazy update (when your nodes are not
      updated eagerly, call this function to update a dependant node, and it
      will automatically update any dependants that need to be, all the way up
      the chain; Warning: do not call inside a transaction! }
    procedure EnsureUpdated;
  end;

And now, the TNodes class -- the container for each TNode:
...
  TNodes = class
  private
    //used to allow recursive transactions:
    FTransactionCount: integer;

    //determines whether to call the callback to update nodes that
    //have no dependancies:
    FComputeIndependantNodes: boolean;

    //the list of nodes:
    FItems: array of TNode;

    //Helper functions...

    function GetItem(ID: INode): TNode;
    function GetNItems: integer;

    procedure Compute(ID: INode);
  public
    constructor Create(AComputeIndependantNodes: boolean;
      Nodes: array of TNodeDesc);

    destructor Destroy; override;

    { functions for doing transactions i.e. assignments to multiple
      independant nodes, so that we're not recomputing dependants for
      each single assignment: }
    procedure BeginTransaction;
    procedure EndTransaction;

    property Items[ID: INode]: TNode read GetItem;
    property NItems: integer read GetNItems;

    property ComputeIndependantNodes: boolean read FComputeIndependantNodes
      write FComputeIndependantNodes;
  end;

//This function is a DSL-like function we will use to specify the properties
//of a node, when we create the dataflow graph:
function Node(AID: INode; AHandler: TNodeHandler = nil;
  AEagerlyUpdated: boolean = True): TNodeDesc;

type
  //A helper class that allows us to use strings in place of integers,
  //for nodeIDs:
  TNodeNames = class
  private
    FNodeIDs: array of string;

    function GetNodeID(AID: string; var AINode: integer): boolean;
  public
    constructor Create;
    destructor Destroy; override;

    //call this function when referring to a string ID that is "new" i.e. when
    //we define the dataflow graph):
    function NewNodeID(AID: string): INode;

    { call this function to refer to an already-existing string ID.
      This function will throw a runtime error if the ID passed is not found
      (good for catching mis-spelled string IDs): }
    function NodeID(AID: string): INode;
  end;

So, there you have it -- the declarations! Now, let's peek under the hood and look at a couple details:
constructor TNode.Create(AList: TNodes; ADesc: TNodeDesc);
begin
  inherited Create;
  ...
  { in the TNode constructor, we call the handler to read its dependancies.
    No handler assumes no dependancies; the FCanReadDependancies flag allows
    us to enforce that no calls can be made to TNode.ReadDependancies(),
    except at this point: }
  if Assigned(FHandler) then
    begin
      FCanReadDependancies := True;
      FHandler(Self, naReadDependancies);
    end;
  FCanReadDependancies := False;
end;

Take a look at these next two routines:
procedure TNode.EnsureInvalidated;
begin
  //only invalidate if already valid
  if not FValid then Exit;

  FValid := False;
  EnsureDependantsInvalidated;
end;

procedure TNode.EnsureDependantsInvalidated;
var
  iDependant: integer;
begin
  for iDependant := Low(FDependants) 
   to High(FDependants) do
     FList.Items[FDependants[iDependant]].EnsureInvalidated;
end;

As we can see, invalidating a node will recursively traverse down to its dependants, and its dependant's dependants, etc.
//when a node has been set,
procedure TNode.HandleSet;
begin
  //it's now valid
  FValid := True;
  
  //but now all dependants need to be updated  
  EnsureDependantsInvalidated;

  { after invalidating all dependants, now we
    update all dependants, but only if we're not
    inside a transaction, which will do this for us! }
  if FList.FTransactionCount = 0 then
    EnsureDependantsUpdated;
end;

Notice the next routine:
procedure TNode.EnsureDependantsUpdated;
var
  iDependant: integer;
begin
  for iDependant := Low(FDependants) to High(FDependants) do
    if FList.Items[FDependants[iDependant]].FEagerlyUpdated then
        FList.Items[FDependants[iDependant]].EnsureUpdated;

  for iDependant := Low(FDependants) to High(FDependants) do
      if FList.Items[FDependants[iDependant]].FEagerlyUpdated then
          FList.Items[FDependants[iDependant]].EnsureDependantsUpdated;    
end;

So, why don't we just use the same recursion for updating dependants as we do for invalidating dependants? Because, in order to avoid updating dependants before ALL their dependancies are ready to use, we must perform a breadth-first traversal of nodes. The other way would perform a depth-first traversal, resulting in some node down the chain trying to update itself with dependancies that we haven't traversed to and invalidated or updated yet.

Ok, last but not least:
procedure TNode.EnsureUpdated;
var
  iDependancy: integer;
begin
  { We mustn't call this procedure within a transaction; the library is
    guaranteed to never call this procedure in a transaction, so this assertion
    will only go off if the programmer using the library makes a mistake: }
  Assert(FList.FTransactionCount = 0);
  
  //only update a node if it's not yet updated:
  if FValid then Exit;

  //update each dependancy node: notice that we do a depth-first, recursive
  //traversal down to all dependancy nodes, with this:
  for iDependancy := Low(FDependancies) to High(FDependancies) do
    FList.Items[FDependancies[iDependancy]].EnsureUpdated;

  //handle calling the callback, which performs the actual
  //calculation of the node:
  FList.Compute(FID);

  //and now we're valid:
  FValid := True;
end;

That's all the implementation I'm going through -- the main core. One of my methodologies is that the quickest way to learn how to use a library is to see examples, or use-cases, of it, so now let's see what it looks like to USE this library!

Example

Below is a small-scale example -- the full listing of an example form, with comments:
unit Unit1;

interface

uses
  //Standard Delphi 4 Units:
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls, Spin, Math,

  //Our dataflow graph library:
  uDG;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    SpinEdit1: TSpinEdit;

    procedure FormCreate(Sender: TObject);
    procedure FormDestroy(Sender: TObject);

    procedure FormMouseMove(Sender: TObject; Shift: TShiftState; X, Y: Integer);
    procedure SpinEdit1Change(Sender: TObject);
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }

    //our DG library's data structures:
    FNodes: TNodes;
    FNodeNames: TNodeNames;

    //values of the form:
    FFlag: boolean;
    FLastMousePos: TPoint;

    //callbacks for each non-independant node:
    procedure HandleNode_CanvasPenWidth(ANode: TNode; AAction: TNodeAction);
    procedure HandleNode_Label1Caption(ANode: TNode; AAction: TNodeAction);
    procedure HandleNode_Label1Position(ANode: TNode; AAction: TNodeAction);
    procedure HandleNode_Label2Caption(ANode: TNode; AAction: TNodeAction);
    procedure HandleNode_Label3Caption(ANode: TNode; AAction: TNodeAction);
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.DFM}

procedure TForm1.HandleNode_CanvasPenWidth(ANode: TNode; AAction: TNodeAction);
begin
  case AAction of

    { time to get the dependancies for this node; note that you can only call
      the following method in a callback, when AAction is assigned the given
      value here: }
    naReadDependancies:ANode.ReadDependancies([
      FNodeNames.NodeID('SpinEdit1.Value')
    ]);

    { time to compute the value of this node; notice how the value for
      Canvas.Pen.Width depends on the value for SpinEdit1.Value; this is why
      we have SpinEdit1.Value in the call to ReadDependancies() above; our
      dataflow graph will automatically update dependant values when their
      dependancies are flagged as changed; in this case, the following code
      will be run: }
    naUpdate:Canvas.Pen.Width := Max(SpinEdit1.Value, 1);
  end;
end;

procedure TForm1.HandleNode_Label1Caption(ANode: TNode; AAction: TNodeAction);
begin
  case AAction of
    naReadDependancies:ANode.ReadDependancies([
      FNodeNames.NodeID('SpinEdit1.Value')
    ]);
    naUpdate:Label1.Caption := IntToStr(SpinEdit1.Value);
  end;
end;

procedure TForm1.HandleNode_Label1Position(ANode: TNode; AAction: TNodeAction);
begin
  case AAction of
    naReadDependancies:ANode.ReadDependancies([
      FNodeNames.NodeID('FLastMousePos'),
      FNodeNames.NodeID('Canvas.Pen.Width')
    ]);
    naUpdate:begin
      Label1.Left := Round(0.2*FLastMousePos.x + 0.8*Label1.Left);
      Label1.Top := Round(0.2*FLastMousePos.y + 0.8*Label1.Top);

      Canvas.LineTo(Label1.Left, Label1.Top);
    end;
  end;
end;

procedure TForm1.HandleNode_Label2Caption(ANode: TNode; AAction: TNodeAction);
begin
  case AAction of
    naReadDependancies:ANode.ReadDependancies([
      FNodeNames.NodeID('SpinEdit1.Value'),
      FNodeNames.NodeID('FFlag')
    ]);
    naUpdate:Label2.Caption := IntToStr(SpinEdit1.Value+Ord(FFlag));
  end;
end;

procedure TForm1.HandleNode_Label3Caption(ANode: TNode; AAction: TNodeAction);
begin
  case AAction of
    naReadDependancies:ANode.ReadDependancies([
      FNodeNames.NodeID('FFlag')
    ]);
    naUpdate:
      if FFlag then
        Label3.Caption := 'True' else
          Label3.Caption := 'False';
  end;
end;

procedure TForm1.FormCreate(Sender: TObject);
begin
  //create the TNodeNames helper class
  FNodeNames := TNodeNames.Create;

  //use it here; notice I'm calling its NewNodeID() method,
  //to "declare" new node IDs:
  with FNodeNames do
    FNodes := TNodes.Create(True, [
      Node(NewNodeID('FLastMousePos')),
      Node(NewNodeID('Canvas.Pen.Width'), HandleNode_CanvasPenWidth),
      Node(NewNodeID('Label1.Caption'), HandleNode_Label1Caption),
      Node(NewNodeID('Label1.Position'), HandleNode_Label1Position),
      Node(NewNodeID('Label2.Caption'), HandleNode_Label2Caption),
      Node(NewNodeID('Label3.Caption'), HandleNode_Label3Caption),
      Node(NewNodeID('FFlag')),
      Node(NewNodeID('SpinEdit1.Value'))
    ]);

  //the transaction to initialize all values:
  FNodes.BeginTransaction;
  FLastMousePos.x := 30;
  FLastMousePos.y := 30;
  FFlag := False;
  FNodes.EndTransaction;
end;

procedure TForm1.FormDestroy(Sender: TObject);
begin
  FNodes.Free;
  FNodeNames.Free;
end;

procedure TForm1.FormMouseMove(Sender: TObject; Shift: TShiftState; 
  X, Y: Integer);
begin
  //assign an independant node's values:
  FLastMousePos.x := X+Random(5)-2;
  FLastMousePos.y := Y+Random(5)-2;

  //tell the dataflow graph that we just assigned to an INDEPENDANT node's
  //values; it will update the dependant values for us:
  FNodes.Items[FNodeNames.NodeID('FLastMousePos')].HandleSet;
end;

procedure TForm1.SpinEdit1Change(Sender: TObject);
begin
  FNodes.Items[FNodeNames.NodeID('SpinEdit1.Value')].HandleSet;
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  FFlag := not FFlag;
  FNodes.Items[FNodeNames.NodeID('FFlag')].HandleSet;
end;

end.

Conclusion

Well, that's today's programming recipe! Hope you've gotten something useful out of this! Below is the source code for both the library and the example:

Library Source
Example Source