As the scale of software systems continues expanding, software architecture is receiving more and more attention as the blueprint for the complex software system. An outstanding architecture requires a lot of professional experience and expertise. In current practice, architects try to find solutions manually, which is time-consuming and error-prone because of the knowledge barrier between newcomers and experienced architects. The problem can be solved by easing the process of apply experience from prominent architects. To this end, this paper proposes a novel graph-embedding-based method, AI-CTO, to automatically suggest software stack solutions according to the knowledge and experience of prominent architects. Firstly, AI-CTO converts existing industry experience to knowledge, i.e., knowledge graph. Secondly, the knowledge graph is embedded in a low-dimensional vector space. Then, the entity vectors are used to predict valuable software stack solutions by an SVM model. We evaluate AI-CTO with two case studies and compare its solutions with the software stacks of large companies. The experiment results show that AI-CTO can find effective and correct stack solutions and it outperforms other baseline methods.
Keywords: Knowledge graph; graph embedding; software architecture
As the scale of software systems continues expanding, software architecture is receiving more and more attention as the blueprint for the complex software system. Software architecture establishes the link between requirements and implementation [[
Motived by the above problem and challenge, this paper proposes a novel method, AI-CTO, to suggest software stack solutions. AI-CTO consists of three stages: Establishment of software knowledge graph; Embedding of the knowledge graph; Deriving of software stack solutions. The basic idea is extracting knowledge from a well-designed software graph to facilitate architecture tasks. As the relations among software entities can be reasonably represented by the knowledge graph, we firstly build the five-layer software graph: (
For automated analysis of the architectural problem, we intend to embed the entities in the software graph into a low-dimensional vector space. However, most of the previous researches about graph embedding just learned from structural triples [[
To derive the software stack solutions according to the requirements of architects, we propose a requirement-based-walk method to select a set of stack solutions, which satisfy the requirements. These preliminary stack solutions are further filtered by a Support Vector Machine (SVM) model. The feature vector of each stack is calculated by the embedding results. The SVM model is trained by the stack solutions of large companies.
According to the above ideas, we build the prototype of AI-CTO. The software graph contains 11876 entities and 43269 relations, including 3175 software entities and 350 company entities. We set four baselines on two experiments to evaluate the correctness and usage of AI-CTO. For the correctness experiment, the results of AI-CTO are verified by real software stacks. For the usage experiment, this experiment counts the companies using the same stacks as AI-CTO results. The results show that AI-CTO outperforms other baselines and can suggest satisfactory software stacks.
In summary, we make the following contributions in this paper:
- - We introduce the concept of knowledge graph to formally represent software entities and requirements of architects, which converts development experience to knowledge and narrows the knowledge barrier between newcomers and experienced architects.
- - We implement the prototype of AI-CTO to extract effective software stack solutions from the software knowledge graph. The method makes full use of both the semantic information of software descriptions and graph structure information. The software stack solutions are further filtered by an SVM model.
- - We present an extensive evaluation of AI-CTO. The experiment results show that AI-CTO outperforms other baseline methods.
The rest of this paper is organised as follows. In Section 2, we discuss the background information needed to have a primary impression of the technologies used in our method. Section 3 details the software stack solution method. The method is analysed and evaluated in Section 4. Section 5 reports factors that may affect experiment results. Section 6 discusses the related work of our work. Finally, we summary this paper in Section 7.
This section discusses motivation, formalisation of the problem in the selection space and notions about knowledge graph.
There is a phenomenon of inadequate utilisation of existing knowledge and experience. Newcomers possess limited knowledge and experience [[
In addition, the knowledge and experience of architects are full of entities and relations, such as software, companies and dependencies among them, which is very similar to the concept of the knowledge graph. To facilitate the analysis in form of vectors, the graph is embedded into a continuous low-dimensional vector space.
The embedding process takes account of two kinds of features. The graph structure features reflect the space characteristic while the description features reflect the semantic information. Therefore, an SVM model is used to find the boundary of that.
This problem can be explain by defining a selection space. Let a be a vector having elements a
(
The task for architects is choosing an option from the selection space. For example, the architecture of a simple web application usually consists of web server, client and database, i.e., the a has three elements. According to a technology stack website
The knowledge graph efficiently stores objective knowledge in form of triples. Each triple contains two entities and the relation between them. For example, triple (h, r, t) contains a head node h, a relation r and a tail node t. This kind of knowledge representation can preferably reflect the relation information between entities and it is useful for various domains [[
DKG refers to knowledge graphs that are focused on a specific field, such as software engineering. Software knowledge graph contains not only software but also related entities such as developers, logs and documentation. For example, IntelliDE [[
The software knowledge graph can be used to solve different issues in software engineering, such as design and analysis of functional requirements [[
Graph: Fig. 1 The layout example of the software knowledge graph.
Graph: Fig. 2 The overview architecture of Our method: (
Fig. 2 demonstrates the overview architecture of the proposed approach, which contains three main stages.
The preprocessing stage converts raw data to the knowledge graph. The graph construction process extracts entities and relations from raw data to build a structured software knowledge graph. The raw data comes from online sources and consists of software tools, companies, software labels, etc, which is massive and disordered. Therefore, the data needs to be cleaned in the data storage process. In addition, the output of the graph construction process is stored into the graph database in this process.
In the embedding stage, the software knowledge graph is projected into a continuous low-dimensional vector space. There are two kinds of information that can be used to realise the embedding, graph structure information and auxiliary information of entities. The structure based embedding process projects entities into vector space based on graph structure information. The description encoder process encodes the description of entities to vectors with the same dimension based on auxiliary information. The two kinds of the vector are combined to represent the entity in the software knowledge graph.
The stacking stage constructs software stack solutions. The requirement based walk process walks in the graph to select a primary stack according to user requirements. The result filtering process improves the primary stack according to entity vectors.
One of our central hypotheses is that the technology stacks used by famous companies are efficient and adaptable. Therefore, we crawl technology stack data as elementary knowledge from stackshare
Although the graph data is generated by the relation between entities, it still needs a cleverly designed structure to be represented. The extracted entities and relations are constructed into five layers, as shown in Fig. 1. Each layer is created for a kind of entities. In particular, the label layer reflects the requirements of users, so a stack walks through the label can be thought as that the stack satisfy the requirement.
The software system layer contains the target system or application which we want to build stack for, such as a Web application. Following paragraphs describe the structure and features of the rest of the layers.
The software stack layer consists of four entities, "Application&Data", "Utility", "Devops" and "Business tools". The four architecture items are used to organise a large number of software categories.
The software category layer consists of various categories of basic software items, such as databases, cloud hosting and full-stack frameworks. The software stack is built according to different categories in this layer. For example, a Web application basically contains front-end framework, Web server and database. The stack for a Web application chooses one element from each of these three categories.
The software label layer consists of the function and performance labels, which are used to reflect the characteristic of elementary software items. Those labels are used to represent the requirements of users. For example, developers tend to use high-performance tools, such as NodeJS. The "high-performance" is a performance label of NodeJS, and it represents the requirement of developers on performance. As the category layer and software framework layer is connected with the label layer, it is able to select a software that satisfies the multiple requirements by walking in the graph.
The software framework layer consists of elementary software items, such as NodeJS, JavaScript and Python. In particular, software items used by famous companies connect to the corresponding company entities. This makes the stack used by famous companies different from others in embedded data so that our method can learn the technology features of famous companies.
The output data of the graph construction process is imported into the graph database for succeeding tasks. According to features of the graph, we first build entities in the graph database and then associate them with relations. However, there are too many query operations with this method and it is inefficient. We optimised the operation of importing. When building entities, a part of the entities that have known relations are directly constructed into triples, thus reducing the query operations when building relations later. In addition, the software label data from stackshare is submitted by tool users, i.e. it is crowdsourcing data. Not all labels in the data are valid, some of them are lack of support by developers. The top 60% of the labels are reserved.
The basic idea of our method is analysing software in a continuous low-dimensional space instead of doing that with the symbolic representation of triples. As shown in Equation 2, we proposed a new embedding method by combining two methods to embed entities into the vector space, i.e. structure based embedding (E
(
Due to the classical triple structure, the knowledge graph can efficiently provide graph structure information. According to the Skip-gram model [[
The problem is that node2vec does not distinguish the categories of different nodes in the graph, we avoid this problem by embedding the "category" in the vector space as well.
Graph: Fig. 3 The keywords of a short description for an entity. The different distance between keywords and entity name will result in different weights for each keywords.
For each software entity, there is a short description that reflects the features and functions of the entity. For example, Fig. 4 is the description of NodeJS. The description encoder is built based on the hypothesis that the keywords in the description are able to summarise the main features of an entity. The embedding of each keyword is calculated by word2vec [[
Graph: Fig. 4 The description of NodeJs.
To capture this feature, we take the embedding of entity name as an anchor point. The words closer to entity name have more weight. The description embedding of a entity is calculated by following equations:
(
(
(
where d
Inspired by the work of [[
(
Graph: Fig. 5 Gaussian Function.
The requirement based walk is proposed to find out which category of software is needed by the architecture. The basic idea is that popular categories are what develops need. For example, the "Web Servers" category is used by 3442 companies in our data and the "Graphic Design" category is used by 31 companies. Therefore, the "Web Servers" is considered as a category in the stack, while the "Graphic Design" may not.
Another key is to reflect the requirements of developers. As mentioned in section Graph Construction, there is a software label layer in the knowledge graph, which consists of the function and performance labels. In this paper, those labels are considered as software requirements. However, there are too many labels in the graph, i.e. 7800 labels. It is meaningless to integrate all of the labels in the method. We set a threshold to filter out unimportant labels according to their weights. The weight of a label is calculated according to the number of people who agree it on the stackshare website. Software tools that satisfy the labels are selected to be the preliminary software stack. However, the number of tools in the preliminary data is too large, which will be further filtered by a result filtering process. It is not better to directly select popular software or combination of popular software, but popular is just an important factor. It is also necessary to consider the relationship between software and company, software and software. The knowledge graph is to better reflect this relevance.
The results of requirement based walk process is selected by requirement labels, but it can be further filtered according to relevance among software, companies, labels etc. According to the idea of embedding method, the vectors of popular software and less used software will be located in different areas in the vector space. In addition, the software in a stack used by companies will be further closer. Therefore, considering the small amount of data and improving generalisation performance, we implement an SVM classifier to find the boundary between good stacks and useless stacks. SVM does not need to rely on the whole data, it is important to find the support vector.
The classifier finds a hyperplane W that separates two kinds of "points" with maximum margin. In this paper, the "points" are software stacks, which are classified valuable stack and worthless stacks. However, a software stack consists of multiple entities, each entity is represented as a vector. We simply use the average vector of all the entities in the stack to train the SVM classifier. The loss function is:
(
(
The s
The AI-CTO method is evaluated with real data from a famous technology exchange community
RQ 1 and RQ 2 examine the effectiveness of AI-CTO.
We evaluate our method on real-world data from stackshare. Stackshare provides data about how famous companies build a software system. For example, there are 35 tools used by Facebook. Table 2 lists 10 of them. The problem is that the items in the data are discrete. Therefore, the data is converted to a software knowledge graph, which contains 11876 entities and 43269 relations. In particular, the statistics of the knowledge graph are listed in Table 1, and each item corresponds to the five layers of the graph.
A graph database is used to facilitate the query and storage in this experiment. The top five databases of DB-Engines Ranking
Table 1 The statistics of the software knowledge graph
System Stack Category Label Framework Company 1 4 546 7800 3175 350
Table 2 Technology Stack of Facebook
Software tool Category PHP Languages React Javascript UI Libraries GraphQL Query Languages Memcached Databases Cassandra Databases Flux Javascript UI Librarie Tornado Frameworks (Full Stack) HHVM Virtual Machine Relay Javascript UI Libraries Yoga Javascript UI Libraries
To have an intuitive analysis of embedding results, Principal component analysis (PCA) algorithm is used to project vectors to 2-D, so that the embedding can be visualised. Fig. 6 is the 2-D projections of 128-D embeddings of the "python" and related nodes. The pink start is the "python" node. PCA was invented by Karl Pearson [[
The baselines are implemented to compare with AI-CTO from two points of view. One is the feature used in AI-CTO, i.e., the graph structure feature and the description feature. Another is the method used to form a stack. It is clear from Fig. 6 that nodes with high correlation will be closer to each other. Therefore, the basic idea for baselines to form a stack is calculating the distance between software in a stack.
Graph: Fig. 6 The embedding results of "python" and related nodes, which are projected to 2d by PCA algorithm. The pink start is the "python" node.
The baseline one extracts graph structure information to represent nodes in the software knowledge graph. The basic idea of this model is that the closer the two nodes are, the more relevant they are. Relevant nodes are considered as good combination for software development.
The baseline one model takes embedding results of node2vec as representations of nodes in the graph. The requirement based walk process selects all possible software stacks from the graph to give a preliminary software stack. However, there are too many groups in the preliminary data. The model calculates the Euclidean distances between node vectors to filter irrelevant results out. Fig. 7 illustrates the architecture of baseline one model.
Graph: Fig. 7 Baseline One: the node2vec results and distance calculation.
The baseline two extracts semantic information to represent nodes in the software knowledge graph. The training data consists of Wikipedia texts and description texts from stackshare. The basic idea of this model is as same as that in baseline one model, but the representations of nodes are different. This model also takes Euclidean distance as a metric to filter irrelevant results out. Fig. 8 illustrates the architecture of the baseline two model.
Graph: Fig. 8 Baseline Two: the word2vec results and distance calculation.
Table 3 The categories used in the evaluation for AI-CTO
Category Company num Label Label Weight Languages 981 Can be used on frontend/backend 1600 Databases 501 Document-oriented storage 789 Javascript UI Libraries 355 Cross-browser 1300 Javascript MVC Frameworks 203 Quick to develop 883 In-Memory Databases 222 Performance 843 Frameworks (Full Stack) 479 Npm 1300 Web Servers 312 High-performance http server 1400 Microframeworks (Backend) 106 Simple 322 General Analytics 254 Free 1500
The baseline three combines both graph structure information and semantic information to represent nodes. In other words, the baseline three model is the combination of baseline one and baseline two. This model also uses the same method to filter irrelevant results out. Fig. 9 illustrates the architecture of the baseline three model.
Graph: Fig. 9 Baseline Three: the graph structure + description text results and distance calculation.
Our method combines both graph structure information and semantic information to represent nodes, which is the same as that in baseline three. However, while baseline three just takes Euclidean distance as a metric to filter irrelevant results out, our method implements an SVM model to predict whether a software stack is valuable or not.
Training: The SVM model is trained by positive and negative samples. The positive samples consist of software stack data of 350 famous companies, each stack of a company is a piece of a positive sample. The negative samples are generated with two predefined rules:
- - Single software. As a stack is a set of software, single software will not be a positive sample.
- - Stacks with unpopular software. Normally, the developer tends to use popular software. Therefore, stacks with unpopular software are thought as negative samples.
In addition, software in a stack is represented as vectors, so the stack is represented as the average of the sum of all software vectors.
To build a software stack, it is important to select appropriate categories. This problem solved by AI-CTO in two aspects. One is the number of companies using the category, which reflects the practicality. Another is the labels related to the category, which reflects the ability to meet user demand. Table 3 shows the categories used in the evaluation of AI-CTO. As the number of labels are too large, the table only records the most weighted label. In fact, the label is for software in the category, the heavier weights mean the more users paying attention to this label, i.e., user demand.
Motivation: To answer RQ1 (Does AI-CTO find effective results?), the AI-CTO is evaluated with real-world data from stackshare. The AI-CTO is built by both graph structure and description text features and predict valuable software tool with an SVM model. This is different from the baseline one and two. In addition, baseline three also integrate graph structure and description text features, but it derives valuable software tool by calculating the Euclidean distance. In this experiment, we would like to investigate whether AI-CTO can outperform the baseline methods.
Metric: We would like to perform the popular metric, Hits@, in our experiment. For example, Hits@10 is the proportion of correct stack solutions ranked in the top 10.
Results: We test the correctness experiment on three baseline methods and AI-CTO. As the models obtain a great deal of ranking results from the selection space, we choose 20, 50, 100 as metrics to evaluate the methods. According to the results reported in Table 4, AI-CTO perform better than all baseline methods in Hits@20, Hits@50 and Hits@100. Fig. 10 shows the results of different number of "category". The node2vec performs better than word2vec, because the graph structure can better reflect the distance feature compared with the text feature. However, the description text features are still helpful to distinguish the software entities.
Motivation: This experiment is performed to answer RQ2 (Are the AI-CTO solutions used by real users?). We would like to investigate whether the solutions of AI-CTO are used by companies.
Metric: We would like to perform the number of users in this experiment. The top(x) means the number of companies who use the top x stacks.
Results: According to the results reported in Table 5, AI-CTO perform better than all baseline methods in top20, top50 and top100. The solutions derived by AI-CTO are used by real users.
Table 4 The results of evaluation with correctness. 8 categories used
Metric Hits@20 Hits@50 Hits@100 node2vec 0.85 0.8 0.64 word2vec 0.0 0.0 0.07 node2vec+word2vec 0.55 0.38 0.44 AI-CTO
Graph: Fig. 10 The hits results of different number of "category" in the stack.
Table 5 The results of evaluation with number of users. 8 categories used
Metric top(20) top(50) top(100) node2vec 45 92 137 word2vec 0 0 7 node2vec+word2vec 11 22 51 AI-CTO
There are defects of AI-CTO, this section discusses the threats to validity and why AI-CTO is still effective.
The software stack is built based on the software categories required by developers. For example, a website application may need three kinds of tool, front-end framework, web server and databases. The stacking process selects three tools for the three kinds of tool respectively. Therefore, an important task is determining which categories are needed. Based on the hypothesis that the more companies using this category, the more important it is. AI-CTO choose a category depending on the number of companies using it. Thus, the performance of our method depends on the quality of the technology stack data. In the future, we will try to analyse the characteristics of the software category itself, and then compare it with the results of AI-CTO.
As the dataset used in this paper is invariant, the performance of AI-CTO may be affected when new data is generated. However, companies do not adjust their technology stacks frequently. As the idea of AI-CTO is learning from the experience, the "old" data is sufficient to verify the feasibility of AI-CTO. In addition, it is hard to test the integrality of data from stackshare, and it is unable to check whether the technology stack of a company is complete on stackshare.
Usually, a knowledge graph consists of various entities and relations, i.e., the classification of relations can be different. The different relation can represent more information in a graph. The software knowledge graph in this paper contains only one kind of directed relation with weights. However, the relations in this graph do not include any attributes. As a node is represented by its neighbourhoods, single type of relations is capable to depict the information in the embedding process.
Software architecture is the blueprint of a system. The architecture reflects the constrained relationships among software components, and those constraints most often come from the system requirements [[
In addition, it is difficult to clearly distinguish the boundary of architecture layer and design layer [[
Graphs exist widely in the real world [[
Early technologies are built based on the symbolic representation of graphs, i.e. triples. However, due to the complex structure of graphs, these technologies are computationally inefficient when dealing with large-scale graph [[
There are also graph embedding technologies using predefined graph structure as features. The embedding work in TransE [[
The input data of graph embedding is diverse, such as the whole graph, nodes, and edges. There can be certain auxiliary information for nodes and edges, such as text descriptions, attributes, labels, and etc, which can be used to enhance the performance of graph embedding. The challenge is how to incorporate the auxiliary information in graph embedding model and how to combine it with graph structure information.
Xie et al. [[
Wang and Li [[
This paper proposes AI-CTO, a novel method to automatically suggest software stack solutions. The basic idea of AI-CTO is converting the development experience of famous companies to knowledge, i.e., software knowledge graph. The software stack solutions are derived from the knowledge. To reach this end, we embed the software knowledge graph to a low-dimensional vector space. In addition, we combine embedding of software descriptions to make the graph embedding more precisely. The evaluation of AI-CTO consists of two research question to analyse its effectiveness. The results show that AI-CTO can suggest effective solutions and outperform other baselines. We will explore the following research directions in future: (
This work is supported by National Key R&D Program of China (No. 2018YFB0803600) and National Natural Science Foundation of China (No. 61772507).
By Xiaoyun Xu; Jingzheng Wu; Mutian Yang; Tianyue Luo; Qianru Meng; Weiheng Li and Yanjun Wu
Reported by Author; Author; Author; Author; Author; Author; Author