Progress summary I

One year has passed since the Phase 3 ended. I have been working all this time in a 3D engine prototype that will eventually help to design the actual 3D engine to be published along with the rest of libraries of Quimera Engine. It is not finished yet but I wanted to stop for a moment and look back to have an idea of how much distance did I advance. This post is just a summary of all the features implemented until now, although a pending feature list will be shown at the end too.

Currently, the engine is based upon Open GL 4.5 (DSA functions, GLSL 4.5, SSBOs, #includes in shaders, std430 layout, program pipelines, glBlitNamedFramebuffer, glClearTexImage, etc.) although every interface, enumeration and data structure has been designed in such a way that switching to DX11 is quite straightforward. The approach used to design the main components has been a mixture of OOP and DOD, depending on the case; let's say that data is organized as SOAs in memory, but is accessed through class methods, indirectly. As a convention, for now, it is left-handed, row-major and the 2D origin is at the bottom-left corner. The intention is to use as less different shaders as possible, so ideally there would be only 1 shader to calculate all the possibilities (in practice, this does not happen), although it is possible to change the current shader at any moment if the input data structure is the same.

Following, main features are presented in the order they were implemented:

Loading 3D models and textures with ASSIMP and STB

As a future step, the engine will use its own format and ASSIMP will be moved to a custom conversion tool. Stb is used for loading RGB and RGBA images (not compressed).

 

Oct-trees + OBB + Frustum

Currently the oct-tree is used to filter the entities whose bounding box is inside the current camera's frustum, before rendering. The scene oct-tree is updated every frame so depending on the transformations of the dynamic entities' bounding boxes, only cells that contain an entity are enabled, as can be seen in the image.

 

Loading DDS compressed textures

The parser supports BC1, BC2, BC3, BC4 and BC5 compression formats. It will support BC6H and BC7 in the future.

 

Deferred Rendering

The G-Buffer contains 7 textures: Positions, Normals, Albedo color, Specular color (rgb) + specular diffusion (a), Tangents (xyz) + specular power (w), Bitangents and Emissive color. Blinn-Phong ligthing used.

 

Normals, tangents and bitangents visualization

 

Self-shadowed parallax occlusion mapping (deferred rendering)

These 2 images show a quad (2 triangles) using a normal map with parallax occlusion enabled.

 

DTX5 normal maps supported

In the image, normal mapping is enabled in the left and disabled in the right.

 

2.5D engine and Text rendering using FreeFont and a font atlas. Dead Reckoning signed distance fields

2D entities are shown as 3D quads generated in the geometry shader, facing an orthographic camera. 2D entities scale with the window, optionally. The SDF technique can be applied to any monochrome image (opacity map), as can be seen in the last image of this section. Available effects are: parallel shadows (color, distance, blurring), glowing (color, softness, radius), antialiasing (amount) and outlining (width, color, softness). It is also possible to use any texture or combination of textures as inner color. Text can be wrapped into a rectangle (different modes) and transformed as any other 2D entity.

 

Transparency using Blended-Weighted OIT algorithm (no back-to-front polygon ordering)

This technique requires forward rendering, at the moment.

 

Others

Multi-texturing: Up to 8 textures can be stacked and blended for each component (diffuse maps, specular maps, normal maps, etc.).

Reoriented normal mapping: This is the technique used to blend normal maps.

Gamma correction

SMDI textures supported

Keyboard and mouse inputs (some advanced features)

Window management (resizing, fullscreen)

 

TO-DO

Multiple lights

Texture transformations

Cube maps (currently they can be loaded but not shown)

Picking

Infinite grid

Keyframe animations

Bone animations

Instancing

Shadows

HDR

Bloom

Ambient occlusion

Fresnel lighting + BRDF

Clustered lights

Blurring (postprocess)

Depth of field

Level of detail

Own binary format for assets and converter

Assets hot-loading

Light texture projection

LUA interface

Particle systems with OpenCL

 

The list is as long as exciting. It will take another year for sure, at least, since I have to work in my spare time, but I am not in a hurry anyway.

 

Migrating to Github

As most of developers may already know, Google decided to get rid of another of their products some time ago, Google Code in this case. Moving a Subversion repository to another place is not complex, the problem is that it is not just about source code: the Wiki, the task manager or the contributor list will also dissapear by this year. For Quimera Engine, the only thing that would suppose a notable loss of time would be moving more than 800 tasks that exist in the Issue Tracker, counting both open and closed. Fortunately, Google provided (which is the least they could do) every administrator with a script to make the migration easier to one of many project management platforms with code repository out there: Github. I do not know whether the script was made by either Github's people or Google Code's.

All the problems and necessary changes that had to be done for this project are described below:

  • The directories structure of the repository had to be changed. Git only admits a trunk folder and a branches folder in that repository's root directory. The folder tree in this project was the typical one can find in a SVN repository: trunk, team, tags, branches. When exporting using the script, only trunk and branches folders are exported. Therefore, for the user to find exactly the same structure when entering Github (which shows trunk's content only), everything in the root folder was moved to the trunk folder (so there is one trunk into another).
  • All the commits related to all the folders outside the trunk folder were lost. Although folders were moved as described in the previous point, all the commits associated to the team folder, for example, are lost. All the valuable information about the progress, dates, comments describing solutions, etc. lost.
  • Empty .txt files had to be created inside every empty folder in the repository. Git works with files, not directories, so an empty directory does not exist for it and hence they are not kept in the repository. Some of the SQDirectory's and SQFile's unit tests required the existence of some empty directories, so they have to be created by code now; besides, there were other directories which serve to highlight that certain things are planned but not done yet. There are some hacks to simulate empty folders in Git, but it is not necessary.
  • Status labels had to be created and assigned to every issue. This information is lost during the exportation, only open and closed states are distinguished. For example, an issue whose status is "Ready For Review" (open, ready to be reviewed by someone), this would not appear in the corresponding issue in Github.
  • Although it was not necessary, instead of using labels to know which phase an issue belongs to, they have been removed and replaced with milestones.
  • Obviously, both the web portal and the wiki of the project were updated for all the links to point to the Github's web.

As can be seen, it was not very hard but it was tedious and consumed so much time though. Next paragraphs are not objective anymore and include personal opinions about the new situation.

After fiddling a lot with Github (which I heard a lot about but never tried), I began to make many questions in my mind and to be overwhelmed, mostly after I saw the new Issue Tracker. The following table shows the pros and cons of Github's Issue Tracker compared to Google Code's (note that only features used by this project are considered, there may be better or worse things that do not affect this project):

Pros Cons

Color labels are more visible, if and when there are not so many.

When a list of labels is shown either when assigning to or filtering tasks, the order in which labels appear cannot be controlled. For example, priority labels cannot be put together, a common prefix has to be added.

The issue's description can be edited.

Issues cannot be removed.

Managing and seeing repository branches is easier.

The order in which labels appear in the list of issues cannot be modified.

Prettier?

There are so few options to order the issue list (since there are not either columns or label groups, it is not possible to order by Status, for example)

Filtering is easier.

Filtering by the active user using any kind of wild card is not possible. If I want to see my tasks, I have to select my user in the filter.

 

There is no "blocked task" concept, hence there is no relation possible among blocked tasks and blocking tasks.

 

Only 25 tasks can be seen in the list at a time

 

After discovering this all, my conclusion was clear: making it look pretty was more important than making it practical. The design of the web, in general, is infinitely better than the Google Code's, more modern and accessible, that is undeniable. But regarding the list of issues it is horrible, one of the worst I have seen from the point of view of a person who has to manage more than 10 issues a day, related among each other, with several labels per issue. Sincerely, I expected much more from a tool that hosts such amount of projects and users. Only 25 issues per page? Really? Not to mention that only 15 fit into my 24" screen. And as I remarked before, when there are more than 2 labels per issue, you get dizzy just by looking at it, it is a color party. Call me rare or classic, but I prefer a table in which I can quickly see 100 issues, their priorities, their types, their status, etc.

I do not like to only be complaining, so I began to think about solutions for these kind of inconveniences. In order to obtain more than 25 rows per page, the only alternative (as far as I know) is to call the Github's API, which only makes sense if you create a client to show the results; although it is something simple, I do not have time for that unfortunately. However, I searched for a plug-in for the web browser which let me overwrite the CSS of any web I visited. In effect, there was a quite good one that worked on both Firefox and Chrome, Stylish, whose official web contains a lot of CSS created by users for thousands of different web sites, including Github. However, after looking at the available styles I realized that was not what I wanted (just in case somebody is out of curiosity, there are styles for making the background black, which is interesting if you code in the dark). So I wrote my own CSS and cannot be more satisfied with the result:

Before and after

AntesDespués

The style is adjusted to resulotions that are equal or greater than 1680x1050.

Obviously there shall be people who do not like it, but from my point of view it is much more useful this way. All the issues of the page are visible, keeping all the original information, without mixing the title text and the labels; even labels of the same kind appear one under the previous, which makes label value distinction easier.

Summarizing, I understand that everybody has his project at Girhub due to the visibility it provides, because it was one of the first to host Git repositories, even due to the features it offers to look for a job and interacting with others, but really, as a tool for a medium-large project that requires a bit complex management, it is not practical at all.

Finally, I would like to ask the reader to tell me if I am wrong and why, and if anybody knows about a free Github client, be it online or local, please tell me.

Graphics device state change optimization

The purpose of this post is to describe an algorithm prototype intended to reduce the amount of state changes to be submitted to the graphics device when rendering a group of meshes whose visual aspect (a set of graphics device states and shader input data) vary. Submission of state changes and shader input data is expensive and is a typical problem every engine tries to solve.

Any correction, be it gramatical or technical, is welcome. If you have any idea to improve the algorithm, feel free to share it in the comments below.

Code and algorithms discussed in this post have not been tested yet, they have been elaborated from scratch and almost everything are conjectures with an actual technical base. Similar to thinking aloud.

 

Inputs

The algorithm receives 2 sets of data: mesh subsets and their corresponding visual aspect Id. A mesh subset is just a vertex buffer that will be rendered using a single visual aspect. A visual aspect contains all the information required to render a mesh subset in an expected way, which includes data to be sent to the shaders pipeline and graphics device states to be changed. A visual aspect structure can be seen below:

Visual Aspect

  • Visual Aspect Component [ #COMPONENTS ]
    • Texture Layer [ #LAYERS ]
      • Texture Id
      • Sampler Id
      • Texture Blender
        • Blend Factor
        • Blend Operation
    • Texture Stack Size
    • Component Enabled
  • Material Id
  • Shaders Pipeline Id
  • Alpha Blender Id
  • Opacity
  • Wireframe Enabled
  • Face Culling Enabled

 

#COMPONENTS is the number of visual aspect components, which are: Ambient, Diffuse, Specular, Normals, Displacement, Height, Emission, Shininess, Reflection, Lightmap, Opacity; hence, the value of #COMPONENTS is 11.

#LAYERS is the number of texture layers in the stack, currently 8.

The total number of attributes of a visual aspect is 11 x (8 x 4 + 2) + 6 = 380.

The following table shows some additional information about every attribute. Cost is calculated taking into account the number of state change functions to call, the amount of data to copy and, in the case of Shader Pipeline Id, the consequences  of changing (depending on the implementation of the engine, it may require always sending all the attributes whose Operation in the table is Send to shader). The Probability of change is just an estimation based of personal recalls and may change in the future; the high the value is, the more often such attribute is expected to be different among the existing visual aspects.

Attribute

Operation

Cost

Probability of change

Texture Id

Change state

1

6*

Sampler Id

Change state

1

3*

Blend Factor

Send to shader

1

3*

Blend Operation

Send to shader

1

3*

Texture Stack Size

Send to shader

2

5**

Component Enabled

Send to shader

3

5**

Material Id

Send to shader

2

5

Shaders Pipeline Id

Shaders Pipeline

3

3

Alpha Blender Id

Change state

2

2

Opacity

Send to shader

1

4

Wireframe Enabled

Change state

1

1

Full Culling Enabled

Change state

1

1

*This value must be multiplied by both modifiers described below, which depend on the texture layer and the visual aspect component.

** This value must be multiplied only by the visual aspect component modifier, described below.

The probability of change depends also on which visual aspect component and which texture layer we are evaluating. The most used texture layer would be, of course, the first layer; the most used visual aspect component would be, with no doubt, the Diffuse component. So, these are the probabilities of each one:

Texture Layer

Probability of change modifier

0

8

1

7

2

6

3

5

4

4

5

3

6

2

7

1

 

Visual Aspect Component

Probability of change modifier

Diffuse

11

Specular

10

Normals

9

Displacement

8

Shininess

7

Reflection

6

Opacity

5

Height

4

Lightmap

3

Emissive

2

Ambient

1

 

The Algorithm

The main idea is to put similar visual aspects together so the number of states to change when switching from one visual aspect to the next is the minimum possible. The steps of the algorithm are:

1-   Grouping mesh subsets by visual aspect in common and ordering by equality.

2-   Comparing all the visual aspects to each other and storing the differences.

Proposal A:

3-   Ordering comparison results by similarity, probability of change and cost.

4-   According to the ordered comparison results, extract the list of visual aspects putting the ones that have less differences contiguously.

Proposal B: Traveling Salesman Problem

3-   Choosing visual aspect with more expensive edges.

4-   Generating a minimum spanning tree.

5-   Traversing the tree in pre-order walk.

This algorithm is valid only in case the number of different visual aspects is greater than 2.

Mesh subsets grouped by every visual aspect should be rendered in the order the visual aspects appear in the result list.

 

1-    Grouping mesh subsets by visual aspect in common

As an input to this stage, we have a list of mesh subsets (which belong to different meshes associated to different scene entities). Every mesh subset has one and only one associated visual aspect, and every visual aspect may be associated to one to many mesh subsets.

The output will be one list of mesh subsets per visual aspect.

Mesh subsets are traversed one by one and added to one of the lists, the one that corresponds to its visual aspect. If such list does not exist yet (because no previous mesh subset had that visual aspect), it is created and the mesh subset is added as the first element.

 

After all mesh subsets have been grouped, they must be ordered in such a way that all those subsets that are equal (same index in same mesh) inside the same group end up contiguous. This way, when vertices are rendered, it is not necessary to set up the same subset more than once.

 

2-    Comparing all the visual aspects to each other and storing the differences

At the end of this stage, we should know how different every visual aspect is with respect to the rest.

As we have seen, the number of attributes of a visual aspect, this is, the number of changes we have to check per pair of visual aspects is 380 at most; note that if the Component Enabled attribute is False, there is no need to compare the rest of attributes in that visual aspect component. The same type of omission occurs depending on the value of Texture Stack Size.

There are many ways to store the result of all the comparisons but we will use a bit set; i.e. every result will be represented with 1 bit:

  • 0 means attributes are different
  • 1 means attributes are equal

 

Therefore, we need 380 bits per comparison. Modern CPUs support operating with 128 and 256-bit operands (moreover, at the time of writing, it is planned to add 512-bit instructions too in the 2nd generation of Intel Xeon Phi processors family, named Knights Landing) so we could use either 2x256-bit integers (512 bits in total, wasting 512 – 380 = 132 bits) or 3x128-bit integers (384 bits in total, wasting 384 – 380 = 4 bits) to store all the comparison results. Using 128-bit types would let us pack arrays more tightly. However, the amount of visual aspect attributes is expected to increase so maybe leaving just 4 free bits is not a good idea; anyway, if we add more than 4 attributes, we will need 512 bits (4x128 or 2x256) and it is faster to execute the same instruction twice than four times.

Now we have to designate the position of every attribute in the bit set. The criterion to follow is that the most significant bit (MSB) corresponds to the attribute with a highest cost of change and with a greatest probability of change. Note that the MSB is not the actual MSB of the 512 bits bit set but the bit #380 (being 1 the least significant bit (LSB)). This table shows the order of all the attributes in the bit set, using a relative order (POC = Probability of change). Although it is not perfect, that table is a well enough estimation of which position every attribute should occupy in the bit set. The Final POC column is calculated by multiplying the Probability of change by the corresponding modifiers for the visual aspect component and the texture layer to which the attribute belongs (depending on the attribute, either modifier could not be applicable, as shown in the tables of the Inputs section). The Final order is calculated by multiplying the Final POC column by the Cost column.

A small adjustment has been performed to get a more reasonable result, since multiplying by the mentioned modifiers has a side effect: those attributes that are not multiplied end up being one or two orders of magnitude lesser than those that are. For example, Component Enabled is multiplied only by the Visual Aspect Component modifier (which can be up to 11) whereas Texture Id is multiplied also by the Texture Layer modifier (which can be up to 8); therefore, the Probability of change of the first attribute will be always lesser than the second’s by one order of magnitude. This makes no sense, as we will see later; the Component Enabled bit must be checked before the rest of attributes of the component. To fix this incorrectness, we just increase some attributes by either one or two orders of magnitude, multiplying by 10 or 100 respectively.

Attribute

Probability of change modifier

Texture Stack Size

10

Component Enabled

10

Material Id

100

Shaders Pipeline Id

100

Alpha Blender Id

100

Opacity

100

Wireframe Enabled

100

Full Culling Enabled

100

 

Comparison results (blocks of 512 bits) will be stored in a 32-byte aligned array. At the same time, we need to store another array that contains pairs of visual aspect Ids, which correspond to every comparison result, and one last array to store the amount of equal attributes per comparison.

In order to know which visual aspects are more similar to each other we need to compare all to all. We can know in advance how many comparisons we will perform by using a simple combinatorial analysis formula for calculating the number of combinations in a set of N elements taken 2 at a time in any order, without repetition:

In our case, m is the number of visual aspects to compare (#VAspects below) and n equals 2 since we are going to make pairs.

Calculating factorials can be both expensive in terms of CPU usage and inaccurate due to integer overflows, which will occur for sure if the rendered scene includes just 100 different visual aspects; take into account that 12! represents a number greater than what can be stored into a 32-bit unsigned integer.  In order to avoid these problems, an equivalent formula can be used instead:

If the simplification steps are not obvious to you, just think about a very simple case. For example let m be 5:

Now the calculation is reduced to 1 division, 1 subtraction and 1 multiplication. It is done just once per frame so it is not really a big performance improvement.

Just to have an idea of what numbers are we talking about, when using 100 visual aspects we have to perform exactly 4,950 visual aspect comparisons. The worst case would mean that we have to do 4950 x 380 = 1,881,000 attribute comparisons. Obviously this is not likely to happen since most of the visual aspects are expected to consist in 1 to 3 textures combined using 1 to 5 components (diffuse, specular, displacement, normals map and shininess, maybe). We can also know the minimum number of comparisons by counting non-conditional attributes (including Component Enabled): 7 x 4950 = 34,650 attribute comparisons.

An additional advantage of knowing the number of comparisons is that we can reserve exactly the amount of memory we need for the three arrays instead of reallocating buffers as they grow. The array of comparison results when using 100 visual aspects would occupy 512 x 4950 = 2,534,400 bits (316.8 Kb), which is not a big number taking into account the amount of memory required by applications nowadays; the array of visual aspect ids would require 64 x 4950 = 316,800 bits (39.6 Kb), supposing we use pairs of 32-bit unsigned integers; and lastly the array of equal attribute counts, for which 16-bit unsigned integers could be used, would need 16 x 4950 = 79,200 bits (9.9 Kb). Total: 366.3 Kb.

Optimization note: Reserved memory could be kept to be recycled in the next frame if and when the required memory block for the next frame is not bigger. Pool allocators might be useful for this purpose. In the end, there will be no need to reserve more memory as the maximum will be reached eventually.

As said before, visual aspect attributes are compared one by one and, every time they are equal, a 1 is written at the corresponding position in the bit set; otherwise a 0 is written. This is performed by using bit masks and the AVX2 256-bit binary OR instruction called VPOR, which is wrapped by the intrinsic function _mm256_or_si256. From Intel’s documentation:

VPOR ymm1, ymm2, ymm3/m256

Bitwise OR of ymm2/m256 and ymm3

 

extern __m256i _mm256_or_si256(__m256i s1, __m256i s2);

Performs a bitwise logical OR operation of the signed integer elements of source vector s2 and the corresponding elements in source vector s1 and stores the result in the destination vector. If either of the corresponding bits of the first and second vectors are 1, each bit of the result is set to 1, otherwise it is set to 0.

 

REMINDER: Do not forget to activate AVX2 compiler extensions with either /arch:AVX2 (VC++) or –mavx2 (GCC). You need to compile for x64 since AVX and AVX2 are only available on 64-bit processors.

We are forced to use signed integers, there is no specific intrinsic function for vectorized unsigned integers. In a later step we will need to compare bit sets as signed integers, which could be problematic if we set the MSB to 1 because that would mean the integer is negative; any comparison between that value and any other would produce an incorrect result since we need unsigned comparisons. Fortunately we do not need to worry about this as long as we do not use the MSB.

The following example shows how input data is processed through this stage:

Imagine every visual aspect consists of just 4 attributes in a bit set of 1 byte. Imagine also that the last 4 bits are assigned to AT1, AT2, AT3 and AT4, in that order. If we have the following 3 visual aspects:

VAspect 1

VAspect 2

VAspect 3

AT1 = 5

AT1 = 2

AT1 = 22

AT2 = 4

AT2 = 4

AT2 = 100

AT3 = 0

AT3 = 1

AT3 = 0

AT4 = 10

AT4 = 99

AT4 = 10

 

We have to do the following comparisons:

VAspect 1 -> VAspect 2

VAspect 1 -> VAspect 3

VAspect 2 -> VAspect 3

 

VAspect 1 -> VAspect 2

Attribute

VAspect 1

VAspect 2

Result

AT1

5

2

0

AT2

4

4

1

AT3

0

1

0

AT4

10

99

0

 

VAspect 1 -> VAspect 3

Attribute

VAspect 1

VAspect 3

Result

AT1

5

22

0

AT2

4

100

0

AT3

0

0

1

AT4

10

10

1

 

VAspect 2 -> VAspect 3

Attribute

VAspect 2

VAspect 3

Result

AT1

2

22

0

AT2

4

100

0

AT3

1

0

0

AT4

99

10

0

 

Using the proper bit masks for every bit position and executing the OR operation, we would end up with these arrays:

Visual Aspects compared (pair of ids)

Comparison result (bit set)

Equal attrib. counters

VAspect 1 -> VAspect 2

00000100

1

VAspect 1 -> VAspect 3

00000011

2

VAspect 2 -> VAspect 3

00000000

0

 

In the case of conditional attributes, such as Texture Id which depends on both Texture Stack Size and Component Enabled, three things can happen:

A.      When the condition has the same value in both visual aspects.

a)    And Component Enabled is TRUE or Texture Stack Size is greater than attribute’s texture layer.

All the dependent attributes are compared.

b)    And Component Enabled is FALSE or Texture Stack Size is not greater than attribute’s texture layer.

No dependent attribute is compared. The bit at the position of every dependent attribute is set to 1. The equal attribute counter is increased by the number of affected dependent attributes.

 B.       When the condition has a different value.

No dependent attribute is compared. The bit at the position of every dependent attribute is set to 0.

Note: If the 512 bits integer equals zero from the beginning it is not necessary to set any bit to zero during the process.

As we will see later, it is necessary to add an extra group of comparisons to the array. This group is reserved for storing pairs whose first Id is the last visual aspect of the input list (“VAspect 3”, in the example) and the second Id is each of the other visual aspects. Due to this addition, this amount of rows must be considered when pre-calculating the size of the arrays (not taken into account in previous calculations in this document). Obviously, the number of extra rows equals the number of different visual aspects – 1.

Every time the algorithm visits a pair whose second Id equals the last visual aspect Id in the original array (“VAspect 3”, in the example), the entire “row” has to be copied to the end of the array, swapping both Ids in the destination pair. This will improve the performance when searching for a pair with a first Id that matches the last visual aspect Id, in the step 4. Remember that all the memory we need has been already reserved so when we say “copy to the end of the array” we are referring to the logical position, i.e. the first empty “row”.

Look at the following example (not related to any other):

Visual Aspects compared (pair of ids)

Comparison result (bit set)

Equal attrib. counters

VAspect 1 -> VAspect 2

0101010

3

VAspect 1 -> VAspect 3

0110101

4

VAspect 1 -> VAspect 4

0010101

3

VAspect 2 -> VAspect 3

0000001

1

VAspect 2 -> VAspect 4

1110010

4

VAspect 3 -> VAspect 4

0111000

3

VAspect 4 -> VAspect 1

0010101

3

VAspect 4 -> VAspect 2

1110010

4

VAspect 4 -> VAspect 3

0111000

3

 

Additionally, in order to optimize next steps, we have to create an array of unsigned integers as we create comparison pairs; every time the first Id of the pair changes, we add its position index to this new array. Since we added an extra group, as explained in the previous paragraph, the length of the array equals the amount of different visual aspects. Thanks to this information, the program knows where every group of comparisons starts and will be able to do less comparisons later. Take a look to the example below (not related to any other) for 4 visual aspects (so the array has a length of 4):

Group positions

 Pairs of Ids

0

VAspect 1 -> VAspect 2

VAspect 1 -> VAspect 3

VAspect 1 -> VAspect 4

3

VAspect 2 -> VAspect 3

VAspect 2 -> VAspect 4

5

VAspect 3 -> VAspect 4

6

VAspect 4 -> VAspect 1

VAspect 4 -> VAspect 2

VAspect 4 -> VAspect 3

 

Note that the first group will always be zero. We could remove it and save a slot but that would imply adding some conditions in later steps and could affect performance; it just is not worth. The value of the slot for the extra group always equals the previous slot’s value plus 1.

As a summary, these are the formulas to know the amount of elements in all the arrays:

  • Visual aspect Id pairs, Comparison results and Equal attribute counters:

 


 

  • Group position index lookup:

 


 

3-    Ordering comparison results by similarity

Once the algorithm reaches this stage, we have all the information we need about how different each visual aspect is with respect to all the others. What we have to do now is to order the array of ids by the amount of equal attributes between visual aspects, from greatest to least, inside every group of comparisons (contiguous “rows” where the first Id is the same).  Besides when there is more than one pair with the same number of equal attributes in the same group, they must be reordered by their Comparison result bit set. As it will be explained later in this step, we have to calculate more things as we traverse the array to order it.

Extending the previous example a bit, we have:

VAspect 1

VAspect 2

VAspect 3

VAspect 4

AT1 = 1

AT1 = 5

AT1 = 6

AT1 = 5

AT2 = 2

AT2 = 2

AT2 = 2

AT2 = 4

AT3 = 3

AT3 = 3

AT3 = 3

AT3 = 3

AT4 = 4

AT4 = 5

AT4 = 4

AT4 = 5

 

 

Visual Aspects compared (pair of ids)

Comparison result (bit set)

Equal attrib. counters

A

VAspect 1 -> VAspect 2

00000110

2

B

VAspect 1 -> VAspect 3

00000111

3

C

VAspect 1 -> VAspect 4

00000010

1

D

VAspect 2 -> VAspect 3

00000110

2

E

VAspect 2 -> VAspect 4

00001011

3

F

VAspect 3 -> VAspect 4

00000010

1

G

VAspect 4 -> VAspect 1

00000010

1

H

VAspect 4 -> VAspect 2

00001011

3

I

VAspect 4 -> VAspect 3

00000010

1

 

So the order would be:

B

VAspect 1 -> VAspect 3

00000111

3

A

VAspect 1 -> VAspect 2

00000110

2

C

VAspect 1 -> VAspect 4

00000010

1

E

VAspect 2 -> VAspect 4

00001011

3

D

VAspect 2 -> VAspect 3

00000110

2

F

VAspect 3 -> VAspect 4

00000010

1

H

VAspect 4 -> VAspect 2

00001011

3

G

VAspect 4 -> VAspect 1

00000010

1

I

VAspect 4 -> VAspect 3

00000010

1

 

In the group of the VAspect 1 comparisons, the comparison B is the first, while in the group of the VAspect 2 comparisons, it is the comparison E.

There are no pairs with the same number of equal attributes in the same group and different bit set in the example but, if they were, they would be ordered as follows:

 

Pair of ids

Comparison result (bit set)

Equal attrib. counters

A

VAspect 1 -> VAspect 2

00000011

10

B

VAspect 1 -> VAspect 3

00001111

10

C

VAspect 1 -> VAspect 4

00000110

10

 becomes:

 

Pair of ids

Comparison result (bit set)

Equal attrib. counters

B

VAspect 1 -> VAspect 3

00001111

10

C

VAspect 1 -> VAspect 4

00000110

10

A

VAspect 1 -> VAspect 2

0000011

10

 

Comparison results should be compared (hail redundancy! :D) using a 256-bit signed integer greater-than comparison.  Bad news, there is no such function in either AVX or AVX2; there are only comparison instructions for vectors of smaller signed integers, like _mm256_cmpgt_epi64. We cannot use those due to the same problem mentioned in a previous section: the input 256-bit value is unpacked as four 64-bit signed integers (if we use the function with the -64 suffix), which are compared separately, i.e. each one uses its MSB as sign bit and that would generate incorrect results. We cannot use it, unless we skip those bits. If we did not use the bits #64, #128, #192 and #256 (being 1 the LSB) then the problem would be avoided. This does not have any cost; we just do not assign any attribute to those bits. But forget all these explanations; this proposal is not to be used, for performance’s sake. We are going to compare the 512-bit bit sets as if they were vectors of 8x64-bit unsigned integers, using a for loop. Take a look to both alternatives:

  • With AVX2
  1. Execute _mm256_cmpgt_epi64, passing both operands as parameters.
  2. Repeat for the second half of both operands (the bit set is formed by 2x256).
  3. Execute _mm256_store_si256 to put the results into signed integers.
  4. Repeat for the second half of both operands.
  5. Cast the resultant bit set to an array of unsigned integers.
  6. Iterate from the most significant integer to the least, comparing each one to the constant 0xFFFFFFFFFFFFFFFF. If it is equal, stop comparing. Eight iterations at most.

  • Without AVX2
  1. Cast the bit set of the first operand to an array of unsigned integers.
  2. Cast the bit set of the second operand to an array of unsigned integers.
  3. Iterate from the most significant integer to the least, comparing both integers to know which is greater than the other. If one value is greater than the other, stop comparing. Eight iterations at most.

 

As you can see, the first requires 4 instruction executions, casting 1 pointer and 8 iterations comparing integers, whereas the second requires casting 2 pointers and 8 iterations comparing integers. Not a big difference but enough to prefer to the solution without AVX2 instructions.

So now, in the top of every group of comparisons, we have the pair of visual aspects with less differences and whose equal attributes are the most expensive to set and the most likely to change. If an attribute is more likely to change it means that it appears in a more significant position in the Comparison result bit set and, as we have just seen, the loop that compares the integers array will stop sooner, saving some processing time. That is why condition attributes (Component Enabled and Texture Stack Size) must be in a more significant position.

Optimization note: Use another array to store the position indices of the other three arrays so you order just an array of integers.

At the same time we traverse the arrays and order them, we have to get what will be the first pair to include in the result, in the step 4. The first pair has the greatest value of Comparison result and Equal attribute counter. We can store the first “row” and then compare each “row” to the stored one as we visit them and, if the current “row” is greater, then the stored value is replaced with the current value and we keep visiting and comparing. To optimize the process, we can limit the times the comparison is performed to the moment when the group changes; at that moment, we compare the first “row” of the last group to the stored value. In the example, when we visit F, we are changing from the “VAspect 2” group to the “VAspect 3” group, and then we can compare the previous first “row” E to the stored value, which should be B. The result would be E.

Optimization note: This process can be parallelized so every thread orders just one group (the size of the group does not change, the memory region is not accessed by any other thread) and returns a proposal for “first pair”. Then all proposals are compared sequentially and the maximum is chosen.

 

4-    According to the ordered comparison results, extract the list of visual aspects putting the ones that have less differences contiguously

We have to choose which will be the first visual aspect we are going to set in our engine, which will be followed by the rest of visual aspects in order of similarity. We cannot know at first sight if any of them will lead to the most optimal sequence, but we know which of them is the most similar to another one with a bigger saving. The “row” we look for is E, as we found in the previous step.

The first time a pair is taken, both visual aspects ids are added to a result array, in the same order. The following times, only the second visual aspect Id is added to the array.

VAspect 2

VAspect 4

 

Once we have taken the first pair, we search for the first pair in the group of the second Id. To find it quickly, we just use the lookup array created in the step 2, which contains its index in the main array. When searching in the last group, we have to iterate forward until we find a pair that contains a non-taken Id. In the example, it is the G pair.

VAspect 2

VAspect 4

VAspect 1

 

Finally, the last pair is the B pair in the example.

VAspect 2

VAspect 4

VAspect 1

VAspect 3

 

When the second Id of the next pair already exists in the result array, this stage finishes. As we can see, the next pair would be F, but VAspect 4 already exists in the result array so we ignore it and finish.

We can now make an estimation of how many state changes we have saved. It always depends on the input list of visual aspects, of course; in some cases we will gain a lot of performance (mostly when the input list is big), and in some other cases we may not gain any (surely when the input list is small). The worst case that implies setting all the states of every visual aspect is 16 (4 attributes for 4 visual aspects) in our example. Thanks to this algorithm we will only need to change 9: 4 for the first visual aspect and 5 when switching to the rest.  Almost 50% of the state changes will be avoided in this case. Please note that, if we obtained the same percentage of similar attributes for a larger list of visual aspects, the first 4 state changes would be relatively marginal and the result would be much better than 50%.

 

Comparison

Result

Equal attrib.

States to change

E

VAspect 2 -> VAspect 4

00001011

3

1

G

VAspect 4 -> VAspect 1

00000010

1

3

B

VAspect 1 -> VAspect 3

00000111

3

1

 

 

An alternative implementation for Steps 3 and 4

If we pay attention to step 2, what we are really doing is to create a weighted undirected complete graph whose vertices are visual aspects and whose edges are visual aspect comparisons. Once we have a that graph, we try to find the longest path that traverses all the vertices, without cycles, no matter which are the start and the end points; the distances used to calculate the path’s length are the Equal attributes counter and the Comparison result bit set.

 

This problem is commonly known as the Traveling Salesman Problem (TSP), symmetric in this case. There are many ways to solve it, from exact algorithms to heuristic or approximation algorithms. Exact algorithms give us the best result but are too slow and their running time goes from polynomial of order O(n!) to O(n2 2n), where n is the amount of visual aspects. Approximation algorithms like Ant Colony Optimization or Christofides’, among others, are faster in exchange for obtaining a result that may not be the best one but most of the time is very close. Apparently, there is a very fast 2-approximation (the result may be at most twice long than the optimal) algorithm consisting in creating a minimum spanning tree (MST) using the Prim’s algorithm starting at an arbitrary vertex, and then visiting the tree’s nodes in preorder walk. We should have to change the comparison operation since we need the longest path instead of the shortest. Finally, just to mention, the Floyd-Warshall is another algorithm that could help us get the shortest Hamiltonian path (i.e. visits all vertices) in the graph if we select 2 vertices as end points (for example, the two with less differences regarding their neighbors); however, it seems to be much slower than the previously described algorithm.

In our example, if we apply the mentioned 2-approximation algorithm, using the Equal attributes counter value as edge size, the result would be:

The “VA2” vertex was chosen as the first due to it is the one whose edges are longer. As an addition to the algorithm, every time two edges have the same length, the one with the greatest Comparison result bit set is chosen. We can see how the same result than before in the other alternative step is obtained; this is just a coincidence due to the amount of visual aspects is so small. We are saving 7 state changes of 16.

 

Conclusion

After some tests on paper with more visual aspects, it seems that the TSP returns better results. It is too soon to be sure of which alternative is better in terms of performance. Both alternatives will be implemented and measured; steps 1 and 2 are mandatory whichever proposal we use, although some parts may be skipped for the TSP. Besides, it is important to compare the performance with and without this optimization algorithm to be sure it does not have the inverse effect (it might happen that executing this algorithm is more expensive than setting all the graphics device’s states); this will happen surely when the number of visual aspects is small; the algorithm could be disabled for a certain amount of visual aspects to avoid that. Although several optimization techniques have been mentioned along the document, more effort has to be put into optimizing the code for speed, even if it means increasing memory consumption (the typical tradeoff), since this algorithm has to be executed once per frame and there is more processing to do by the engine before rendering.

Phase 3 ends

After thousands of battles and hundreds of working hours throughout almost 3 months, the development Phase 3 of Quimera Engine comes to an end. Many new features have been added, most of them related to operating system abstraction and usability improvements on existing components; compilation times and file sizes, along with the design in general have been improved. Progresses are described in detail below. I'm very glad I could perform many of the improvements I had postponed since the beginning. Now the engine as a whole is both more robust and more usable, although there is so much work pending yet.

This stage has been a bit special: it's been the first one I've developed alone from the start to the end. It's been interesting due to the possibility of comparing the way of working and the amount of progress, the pros (like being able to change the design quickly) and the cons (like having to review your own code or not having the possibility of asking opinions). I'm neither going to talk in this post about the reasons that made the team diminish from up to 7 people to 1, nor about my experience managing a team of people who didn't know each other and were dispersed all along the country, each one with its own personal problems and objectives of life, in a barely incipient project with an abstract future; someday I'll write about that to share it with those who could be interested. Apart from the errors any of us committed, some of them avoidable and some of them inevitable, I keep all what I've learned and have confirmed that there are more people out there who seek to fulfill themselves and are capable of enlisting in crazy projects like this. It's a comforting thought for me. "If you want to walk fast, walk alone. If you want to walk far, walk together", some people say; my intention in the beginning was to walk far so I looked for company; what that phrase does not tell is "...and prepare provisions to feed your company during the entire trip, an official map, a compass and a beautiful postcard from the place you are going to, otherwise only Dora the explorer will follow you". Looking at it with some perspective, it doesn't discourage me: I think we went a long way.

Now that I'm walking alone, it's time to change the way I do things and embrace the advantages it implies regarding agility. I already changed the road map in the past, postposed the development of certain less urgent components of Common and Tools layers and advanced others required to develop higher layers; it's the time to jump to the most interesting and satisfactory part of all: graphics. The next phase will consist in the creation of a graphics engine prototype based on OpenGL that will be later integrated in Quimera Engine. It will be a proof of concept, not a "proof of production" (somebody shall understand what I mean); once it implements some minimum functions a clean version will be re-written, a user interface will be designed and it will be joined to the rest of the project as a plug-in. I'll provide more details in posterior posts.

To the point

Next some numbers extracted from the project, as done in the end of the previous phase:

  • 274.856* code lines, without either blank lines or file header comments, counting the test code.
  • 84.544* code lines, without either blank lines or file header comments, not counting the test code.
  • 35.000 code lines, approximately, written during this phase.
  • 461* source code files, counting the test code files.
  • 276* source code files, not counting the test code files.
  • 196* classes, 40 of them created during this phase.
  • 43%* of the code lines (without tests) are internal documentation or public documentation.
  • More than 7.750 unit tests in total, 650 of them written during this phase.
  • 155 revisions in Subversion during this phase (the code is only uploaded when the task is completely done, including unit tests, and after it has passed the revision).
  • 85 tasks completed during this phase.
  • More than 542 tasks finished until now in total, from 702 created in the tracker.

*These numbers were obtained using the Source Monitor tool.

Summary of the finished work

New classes

  • Call stack tracing tools: QCallTrace, QCallStackTrace, QAbstractCallStackTracePrinter, IQCallStackTraceFormatter, QCallStackTracePlainTextFormatter, QCallStackTraceConsolePrinter, QCallStackTracer, QScopedCallTraceNotifier, QTypeWithToString, QTypeWithGetObjectType, macros para facilitar el uso de este mecanismo interno.
  • Iterators QConstHashtableIterator, QConstDictionaryIterator.
  • Timing: QStopwatchEnclosed.
  • File system: SQFile, SQDirectory, QFileStream.
  • IO: QBinaryStreamWriter, QTextStreamReader, QTextStreamWriter.
  • Data types: QArrayBasic, QArrayResult.
  • Containers: QHashtable, QDictionary.
  • Container components: QKeyValuePair, SQStringHashProvider, SQIntegerHashProvider, SQKeyValuePairComparator, SQNoComparator, SQEqualityComparator.
  • Threading: QThread, SQThisThread, QMutex, QSharedMutex, QRecursiveMutex, QScopedExclusiveLock, QScopedSharedLock, QConditionVariable, QScopedLockPair.
  • Others: QEvent, QReferenceWrapper, QAssertException.

Improvements

  • QDynamicArray and QFixedArray were renamed to QArrayDynamic and QArrayFixed, respectively. This makes them more discoverable, since the user expects arrays to be called "Array" + something. Although it may seem a simple decision, it wasn't, due to it also has negative consequences: we often use the "control+spacebar" combination for the contextual help to complete the name, and then we press "Enter" to continue writing; there are 4 classes whose name starts with "QArray" now, so it is necessary to type all those letters before we can choose the class we want, whereas it was enough to write "QDy" or "QFix" to obtain the full name before. The readability improvement is more subjective in this case although reading the words in such order it is less intuitive for English speakers.
  • Enumerated types refactoring. The dependency on STL containers have been removed (the intention is, in the whole project, not to depend on the STL at all). Now enumeration classes are simpler, faster and take less time to compile. Besides, now they don't necessarily depend on the string_q type, although they can; this allows the string_q type to include in its own header file the enumerations it needs, so the user doesn't need to worry about including them every time he needs them (obviously, this has a small impact on the compilation time that, apparently, is worth).
  • Basic data type to string conversion methods refactoring. There are specific static classes for every basic data type, this means, SQInteger for integers, SQFloat for floating point types, SQBoolean for booleans and SQVF32 for 128 bits vectors. So far, every class provided a ToString method to convert the corresponding type to its representation as a character string. Now all that functionality has been moved to the QStringUnicode class as overloads for operator+ and Append, and From*** static methods. This way, 2 things are achieved: First, it is not necessary to include the QStringUnicode class if only basic data types operations are needed; second, concatenating variables is much easier and more readable not having to write method calls when, for example, one wants to log a message. E. g.: QE_LOG(myString + 5 + " + " + 3 + " = " + (5 + 3) would result in "Sum: 5 + 3 = 8".
  • Internal RTTI system refactoring. The internal RTTI system allows a remarkable reduction of the required space to store information about the data types utilized in the project, letting us to know and compare some types even dynamically, when integrated classes are polymorphic. So far the intention was to implement this in a simple manner enforcing all the classes integrated in the mechanism to inherit, directly or indirectly, from the QObject root class. This solution required the use of virtual inheritance to break deadly diamonds which, apart from any discussion about its usefulness or its danger, prevented the correct resolution of polymorphic types as it was implemented. Now QObject is no more (as well as SQTypeExtensions) and macros are used in the base classes instead.
  • Small improvement of QDateTime's interface. So far it wasn't necessary to specify a time zone to create dates and times, using UTC by default, this means, setting the value of the argument corresponding to the pointer to QTimeZone to null. This caused the constructor overloads to overlap due to the implicit conversion of integer to pointer that takes place then the value is zero. To solve such ambiguity, the time zone will be mandatory always. It's a usability loss in exchange for avoiding potential serious problems in the client code.
  • Small usability improvement of QStringUnicode. Contrary to the previous case, some default values have been added in some methods for them to be easier to use. For example, in the past it was necessary to provide the comparison type to be used when calling Contains, and now it is understood to be binary and case-sensitive (the fastest and the most frequent) so we can, simply, write something like myString.Contains("myPattern").
  • Multi-process compilation in Visual Studio. Now it seems that all the projects compile in half the time
  • Exception handling and RTTI support deactivation. Quimera Engine uses its own RTTI system, lighter and faster, only in the parts where it provides some value. Regarding exception handling, the only exception thrown by the engine is the one thrown by the asserts when they are configured that way, uniquely for testing purposes. It's accepted as a rule of thumb that a videogame works perfectly or doesn't work. Unexpected errors (those not found during development) will be traced, when possible, and the program will terminate. All the reasons to make this decision are written in the Quimera Engine's Design Foundations (currently out of date, by the way). Boost libraries had to be recompiled with the corresponding options and some header files had to be fixed in situ in order to make them compile, until a better solution is found. As an important addition, the size of all libraries have been reduced considerably, as it can be seen in the comparative table below.
  • Improvement of machine endianness detection utilities. Some definitions for detecting whether the machine is either Little-endian or Big-endian at compile time have been added. Some functions to check the same at runtime were added too.
  • Migration of the Test Automation Tool from Code::Blocks to CodeLite and port to Linux and Mac OS X. So far, the test automation tool created ad-hoc for executing the Quimera Engine's unit tests was mounted on a Code::Blocks project and had never worked on Mac. Now it's mounted on CodeLite (Code::Blocks is not used anymore for anything) and works on all the supported operating systems.
  • Creation of a tool for converting CodeLite workspaces to makefiles automatically. Updating the makefiles of all the projects every time a source code file was added or a compiler option was changed was a pain in the neck and, most of all, a risk. In order to avoid that I created a small tool in C# (fast development, no portability needed) which transforms CodeLite projects and workspaces into ready-to-execute makefiles, with a pair of mouse clicks. The saving of time is considerable. More details can be found below.
  • General warnings cleaning. They are like bad weeds, they must be kept at bay. Warnings lose their usefulness if they are not kept at zero, when possible. Besides, it's necessary to prevent them from propagating to the library user's project. This happens, just recalling a recent example, with Boost, which forces you to include its libraries as if they were system libraries to avoid a torrent of third party warnings infesting the build log.
  • Support for SIMD instructions and types added to projects. Now when compiling with GCC/Clang too.
  • Partial optimization of QArrayFixed and QArrayDynamic. As I anticipated in the previous end of phase post, I've performed some optimizations in container classes. The most affected class is QArrayFixed, which has the same efficiency as std::vector when iterating (in the end it is boiled down to an integer increment and a dereferencing to obtain the actual value). I've not dedicated so much time to make performance experiments, which I would love to do, but I did some concrete tests and I can say that QList is faster than std::list, for example. I hope I have time to post a demo in the future.
  • Usability improvement of methods which return arrays. Some methods, like QStringUnicode::Split, return multiple results whose amount is unknown. The typical solution is to pass a pointer and an integer as output arguments. To improve both usability and readability, I decided to return a small structure which contains both the number of elements and the array, whose memory has been allocated inside the method. The structure is attached to the content so, when its destructor is called, it will free the memory occupied by the array so the user hasn't got to worry about that. In order to avoid creating something similar to unique_ptr (with the problems it would imply) it's been remarked that the usage of such structure is limited exclusively to the return of this kind of results and hence its name: QArrayResult. The structure can be detached from the content manually, calling Detach. The addition of overloads that receive a pointer (the same to be returned in the structure) to give the possibility of providing a preallocated memory block, increasing the performance, is not discarded.

Pending work

Classes

  • Threading: QConditionVariableExclusive.
  • Hashed strings: SQStringPool, QHashedString

Improvements

  • Mechanism to enable/disable assertions by type (error, warning or deprecation).
  • Memory allocation tracing.
  • Quit STL dependency in QLocalTimezone.
  • Conversion of some container's methods to templates, so they can interact with each other.
  • Add ForEach methods to containers to execute a function per element.
  • Add low priority methods to many of the existing classes.
  • Research on whether it is possible to replace time zone libraries of Boost with ICU's, which seem to be more complete.

 Comparative tables of libraries sizes when compiled with and without both RTTI and exceptions

The following 2 tables show the difference among the libraries before and after the changes. The explanation of every table appears under each one.

  Before After Reduction Total
QuimeraEngineCommon.dll 554 kb 447 kb 20%  
QuimeraEngineTools.dll 2068 kb 1741 kb 16%  
QuimeraEngineSystem.dll 1308 kb 1070 kb 18% 17%
QuimeraEngineCommon.lib 4635 kb 3660 kb 21%  
QuimeraEngineTools.lib 12100 kb 9756 kb 20%  
QuimeraEngineSystem.lib 14254 kb 9728 kb 32% 26%

Compiled with Visual Studio 2010, DebugWin32Sharedrt configuration. The results of this first table are not completely reliable due to, in order to make the libraries compile with the new configuration, some changes were made to the code, which may affect the final size. However, differences are significative enough to assure that both deactivations produced smaller binaries.

  Before After Reduction Total
QuimeraEngineCommon.dll 315 kb 259 kb 18%  
QuimeraEngineTools.dll 1838 kb 1558 kb 15%  
QuimeraEngineSystem.dll 1015 kb 842 kb 17% 16%
QuimeraEngineCommon.lib 2188 kb 1778 kb 19%  
QuimeraEngineTools.lib 9156 kb 7368 kb 20%  
QuimeraEngineSystem.lib 10700 kb 6971 kb 35% 27%


Compiled with Visual Studio 2010, DebugWin32Sharedrt configuration. This table shows, some time later, the reduction due only to the deactivation of exceptions, not modifying any code between both samples. Up to 35% depending on the file.

Since the moment I started working on the task until I finished it, the total size decreased around 25-30%.

CodeLite projects to Makefiles converter

This is a brief presentation of the little tool created to generate totally functional makefiles from CodeLite IDE 6.01 workspace and projects.

1. The workspace is selected:

2. When selected, all its content is loaded into the screen, allowing the visualization of every compilation configuration (using the combo box):

3. Makefiles are generated for every configuration of all the projects in the workspace:

 


Links to the project's Wiki for more information

API reference documentation

State of the project, detailed

Road map

Proposals table

P.D.: I've to learn to ration the information in several posts, otherwise it is impossible to read.

Twitter