Next: Conclusion and future work Up: Automated generation of dependency Previous: Overall Modelling Process

Generating the Models

A basic assumption of the modelling process is that dependencies can be guessed quiet well from the activities of the services, collected in the data of step 5.

A straightforward method to determine the dependencies is to choose data directly expressing this kind of information, like usage entries in log files. Step 5 would then be carried out by extracting the information from the relevant files.

However, a major drawback of this approach is, that log files typically have a proprietary format or sometimes even change between software versions. Even worse, not all applications provide log files containing this information, or its access may be restricted for several other reasons, like e.g. security policies or limited amount of local disk space.

The suggested solution is to concentrate on information which is relatively easy to collect and available for all types of services respectively applications.

Examples for such measurable values allowing to draw conclusions on the services' activities are:

cpu usage compared to the cpu power available over a certain period of time, or
communication bandwidth used by the system the service is running on.

Generally speaking, this is information taken from lower layers, like the operating system, middleware or the transport system.

Of course, this information does not show the dependencies explicitly. The fact that two services show activity at the same time does not yet allow to say that they are dependent, but after observing behaviour several times (over a certain period of time), such a conclusion is plausible.

This is where methods from the field of neural networks are able to make use of their advantages, like:

dealing with uncertain information,
robustness to noise in the input data

and others also described in [#!gko!#].

In this case, a neural network is used to determine whether two real world objects have a relationship or not. It is achieved by training the neural network with the data collected from the real environment, for which the results (whether dependencies between the objects exist or not) are known. Examples are needed for both cases.

To achieve good quality, the training set must contain data from at least two or more distinct ``service implementation - service user'' dependencies as well as pairs of non-related services. Each of them must be observed under various usage conditions and during times of high and low utilisation. Usually, this is the case with data from chosen services in real environments, collected over a longer time period, e.g. a few days including some hours during night and weekend.

During the utilisation of the neural network it may be improved further using reinforcement learning techniques.

Using data from real environments leads to the problem of noisy training data, but with the neural networks ability to generalise these requirements can be met. Furthermore, by this, designing and building a special test field is not necessary -- this would even be impossible when hard to set up services have to be modelled.

Figure shows two plots of data collected from two hosts during the same time. The values shown represent the intensity of the hosts' IP-communications with others during time intervals of five seconds.

**Abbildung:** Example plots of network activities of two hosts
5#5

Of special interest within the plots are the high spikes. At three time intervals (labelled with the numbers 1785, 1801 and 1826 for the first host, respectively 1784, 1800 and 1825 for the second) both hosts show an activity (of nearly the same intensity) indicating a possible relationship. The plot of host one additionally shows activity at other times (at numbers 1794 and 1837) which is just noise for the investigation of the two hosts' relationship.

In the general case, similar data expressing activity must be selected, as described in the previous section, for each object implementing a service. If several objects have to be merged in the final models the data has to be merged accordingly. This simply can happen by assessing them (assigning factors) and summing the values up.

One problem of this method needs further investigation: In the generation process of real world models of course more than just two objects are involved. To test for all possible relationships of n objects O(n²) tests are necessary.

On the one hand this argument supports the use of neural networks, as -- once trained -- they can calculate their tasks faster than traditional correlation analyses. On the other hand it is still a problem for large numbers n.

If only the abstract models are needed finally, it is possible to restrict the modelling process on a very small number of implementations per service. To get complete real world models other restrictions have to be applied. One possibility is to preselect pairs of objects which surely cannot depend on each other. E.g. it is not necessary to test whether two web clients depend on each other. Such exceptions are easy to specify, but significantly reduce the amount of dependencies that have to be investigated.

Another way is to divide the environment that should be modelled into smaller areas, like administrative zones or according to topological aspects. To avoid that these areas must remain absolutely isolated, it is possible to add special objects to each of them representing connections to the outside. This also helps to find the right partitioning: If too many dependencies exist to these objects, it is helpful to add objects to the area. Objects with no (or very few) dependencies within the area are good candidates to remain outside.

Next: Conclusion and future work Up: Automated generation of dependency Previous: Overall Modelling Process