Diff

(Engine-Level Function)

Description:	Compares two buffers and generates a third buffer containing formatted instructions describing how the first buffer can be modified so that it will match the second. This will perform a delimited difference unless the ChunkSize parameter is set to 1 or greater.
Returns:	Invalid (result returned in second parameter)
Usage:	Script Only.
Function Groups:	String and Buffer
Related to:
Threaded:	Yes
Format:	Diff(ResultBuffer, CompletionCounter, Buffer1, Buffer2, [Delimiter, Chunk Size, Clip Length, Edge Length, MaxVariance, PointCap])
Parameters:

ResultBuffer

Required. Any expression that resolves to the variable to be set to the output buffer. This buffer is created asynchronously and should be checked for valid data before use.

The content of this buffer will be an instruction set for transforming the contents of Buffer1 into a duplicate of the contents of Buffer2. A detailed description of this instruction set is provided in the Comments section.

Completion Counter

Required. Any expression that resolves to a variable containing a numeric value or Invalid.

If a numeric variable, the value will be incremented at the instant that Diff is called. It will then be decremented after the Result Buffer has been populated. The same variable can be used to monitor any number of simultaneous, asynchronous Diff operations.

If this parameter is set to Invalid then the Diff operation will be performed synchronously. The function won't return until the Result Buffer is populated.

Buffer1

Required. Any expression that returns the first buffer. This is the buffer that is intended to be modified by the instructions returned.

Buffer2

Required. Any expression for the second buffer. This is the buffer that the first buffer would resemble were the returned instructions applied.

Delimiter

Optional. The bytes used to delimit lines in text buffers (or records in any sort of delimited buffer). Multiple delimiters can be specified by passing an array of text strings.

If not otherwise specified, the default is an array containing typical text file line endings (newline, carriage return or a combination of the two characters in either order).

Can accept either a single string or an array of strings

ChunkSize

Optional. The number of bytes to compare as a unit in a binary buffer. Must be set to 1 or greater to enable a binary diff (a delimited diff is performed by default).

Unless the contents of the buffers are guaranteed to align to a given number of bytes it is recommended that this be set to 1 to enable binary diffs. Defaults to 0.

ClipLength

Optional. This numeric value is an optimization. It indicates how long a string of matches (i.e. both buffers having identical contents) will become before the function decides that it has found an optimal instruction set and will discard competing sets.
If Diff returns sub-optimal instructions you should increase this value. Lower values will reduce the execution time of the function at the cost of the quality of the output. Higher values increase output quality but decrease speed.
Sub-optimal instruction sets will result if strings of matches having the given length can occur randomly within the two buffers. Defaults to 20.

EdgeLength

Optional. Another numeric optimization, best set to twice the ClipLength. Causes the elimination of instruction sets that are estimated to require at least EdgeLength more instructions than the best set at any point during the search.
Sub-optimal instruction sets will result if the estimate is inaccurate by an amount greater than this value.
Lower EdgeLength values will reduce the execution time of the function at the cost of the quality of the output. Higher values will increase output quality but decrease speed. Defaults to 40.

MaxVariance

Optional. Sets a maximum variance, as measured by the number of items changes in the same way. If the DIFF strays from an exact match by MaxVariance by a given number of data adds or deletes, execution will stop.

A mixture of adds and deletes will cancel each other out. When set to a value smaller than the default, files with lots of small modified areas will pass while files with a single modification, larger than this variance, will fail.

Defaults to 1,000,000.

PointCap

Optional. Sets a cap on the number of points that will be searched within the buffers. In effect, this value serves to cause a timeout when comparing extremely large buffers that are almost completely different. Defaults to 1,000,000,000.

Comments:

This function will return Invalid on failure. Otherwise, the return value is a buffer of zero or more binary records. Each record will consist of at least two 32-bit words, containing instructions in the following form:

The highest bit of the first word indicates whether this is a delete instruction or an add instruction. 0 means "delete" while 1 means "add". The remaining 31 bits of the first word (taken as a 31-bit unsigned integer) contain the number of bits to be affected by this operation.

The second word, taken as a 32-bit unsigned integer, indicates the offset of the operation. That is the location of the bytes affected.

If the operation is to add bytes, there will be a binary string following the second word. These are the bytes to be added at the specified location.
Because the diff function uses a searching algorithm, and in particular an incomplete search (that is it tries to find a solution without exploring all of the possibilities), it will at any time only have a partial collection of all the possible solutions to the problem. Each solution is defined as a set of instructions that modify the source buffer, and each of these sets requires a different number of instructions to convert a different region of that buffer. The "best set" is the one that converts the largest portion of the buffer while requiring the fewest changes to it, selected from the solutions that have been discovered so far.

The optimization works by eliminating solutions which appear to be so much worse than the current best set that they are unlikely to recover, as judged by how many more changes they require to convert a similar region. The problem is that a solution which works poorly in one region may perform much better in the others, so the optimization may cause the "real" best set (the one that's optimal for the entire buffer) to be overlooked.

The return value will be an empty buffer if Buffer1 and Buffer2 are identical.

Example:

The following generates a set of undo instructions required to change a "new" file back to an "old" file.

... If 1 NextState;
  [
  ...
    NewFileStream = FileStream(NewFileVersion);
    OldFileStream = FileStream(OldFileVersion);
    Diff(DiffForUndo, Invalid, NewFileStream, OldFileStream);
 ...