ResearchPaper .pdf
File information
Original filename: ResearchPaper.pdf
Title: Learning Ticket Similarity with Contextsensitive Deep Neural Networks
Author: Durga Prasad Muni, Suman Roy, Yeung Tack Yan John John Lew Chiang, Navin Budhiraja and Iheb Ben Abdallah
This PDF 1.5 document has been generated by LaTeX with hyperref package / pdfTeX1.40.14, and has been sent on pdfarchive.com on 11/10/2017 at 20:47, from IP address 5.39.x.x.
The current document download page has been viewed 384 times.
File size: 1.3 MB (9 pages).
Privacy: public file
Share on social networks
Link to this file download page
Document preview
Learning Ticket Similarity with Contextsensitive Deep Neural
Networks
Iheb Ben Abdallah
Durga Prasad Muni, Suman Roy, Yeung Tack
Yan John John Lew Chiang, Navin Budhiraja
Computer Science and Electrical Engineering,
Ecole CentraleSupelec
Grande Voie des Vignes, ChˆatenayMalabry
Paris, France 92290
Iheb.Benabdallah@supelec.fr
Infosys Limited
#44 Electronic City, Hosur Road
Bangalore, India 560100
{DurgaPrasad Muni,Suman Roy,Yeung Chiang,Navin.
Budhiraja}@infosys.com
ABSTRACT
1
In Information Technology Infrastructure Library (ITIL) services
a sizable volume of tickets are raised everyday for different issues
to be resolved so that the service can be delivered without interruption. An issue is captured as summary on the ticket and once
a ticket is resolved, the solution is also noted down on the ticket
as resolution. It is required to automatically extract information
from the description of tickets to improve operations like identifying critical and frequent issues, grouping of tickets based on
textual content, suggesting remedial measures for them etc. In an
earlier work we have proposed deep learning based recommendation algorithm for recovering resolutions for incoming tickets
through identification of similar tickets. In this work we use similar
deep neural based framework to compute the similarity between
two tickets by considering context information. In particular, we
append the feature representation of tickets with context information to be fed as input to deep neural network. Our learning
algorithm seems to improve the performance of similarity learning
using the traditional techniques. In particular the contextenriched
DNN approach on average improves the performance by 56% in
comparison to simple DNNbased approach.
Ticketing system forms a core component for the problem and configuration management for Information Technology Infrastructure
Library (ITIL) services. Vast number of tickets are raised on the
ticketing system by users with a view to resolve issues/concerns
faced by them while using different support systems. These incident
data in the form of tickets can be used for different purposes such as
SLA calculation, forecasting, optimum resource level checking, performance metrics computation etc. A ticketing system tries to minimize the business impact of incidents by addressing the concerns
of the raised tickets. The incident tickets record symptom description of issues, as well as details on the incident resolution using a
range of structured fields such as date, resolver, categories, affected
servers and services and a couple of freeform entries outlining
the description/summary of issues, note by users/administrators
etc. Once a ticket is resolved, the solution is also noted down on
the ticket as resolution as texts. Manual screening of such a huge
volume of tickets would be laborious and timeconsuming. One
needs to extract information automatically from the description of
tickets to gain insights in order to improve operations like identifying critical and frequent issues, grouping of tickets based on textual
content, suggesting remedial measures for them and so forth.
Deep learning allows computational models that are composed
of multiple processing layers to learn representations of data with
multiple levels of abstraction [16]. They typically use artificial
neural networks with several layers  these are called deep neural
networks. Deep neural networks (DNN) are becoming popular
these days for providing efficient solutions for many problems related to language and information retrieval [4, 6, 7]. In this work
we use context sensitive Feed Forward deep neural network (FFDNN) for computing ticket similarity. In an earlier work we have
proposed an automated method based on deep neural networks for
recommending resolutions for incoming tickets through identification of similar tickets. We use ideas from deep structured semantic
models (DSSM) for web search for such resolution recovery. We
take feature vectors of tickets and pass them onto DNN to generate
low dimensional feature vectors, which helps compute the similarity of an existing ticket with the new ticket. We select a couple of
tickets which has the maximum similarity with the incoming ticket
and publish their resolutions as the suggested resolutions for the
latter ticket.
We modify this framework of similarity computing by taking
into account context information. Context may appear in various
forms ranging from neighboring sentences of a current sentence in
CCS CONCEPTS
•Computing methodologies → Neural networks; •Applied Computing → Document management and text processing;
KEYWORDS
Deep Learning; Neural Network; Deep Neural Network; Ticket
similarity; Short Text Similarity; Context Information
ACM Reference format:
Durga Prasad Muni, Suman Roy, Yeung Tack Yan John John Lew Chiang,
Navin Budhiraja and Iheb Ben Abdallah. 2016. Learning Ticket Similarity
with Contextsensitive Deep Neural Networks. In Proceedings of ACM
Conference, Washington, DC, USA, July 2017 (Conference’17), 9 pages.
DOI: 10.475/123 4
This work was done when Iheb Ben Abdallah was an intern at Infosys Ltd during
JulyAug, 2017.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for thirdparty components of this work must be honored.
For all other uses, contact the owner/author(s).
Conference’17, Washington, DC, USA
© 2016 Copyright held by the owner/author(s). 123456724567/08/06. . . $15.00
DOI: 10.475/123 4
INTRODUCTION
Conference’17, July 2017,
Durga
Washington,
Prasad Muni,
DC, USA
Suman Roy, Yeung Tack Yan John John Lew Chiang, Navin Budhiraja and Iheb Ben Abdallah
a document [19] to topics hidden within the sentence [15] and to
the document containing the sentence in question [12]. As neural
networks are normally trained with local information it makes sense
to integrate context into them. Global infomration which may be
embedded in this context information can often be instrumental in
guiding neural networks to generate more accurate representations.
In this task we consider topics associated with tickets as the context
information and feed them as vectors to the deep networks along
with the feature vectors for tickets. The neural network outputs a
low dimensional vector for each of the input vectors. These two
lowdimensional vectors for a ticket are combined to compute the
similarity between two tickets. A schematic diagram of our method
is shown in Figure 2.
We employ this contextdriven similarity technique to three semantic similarity tasks: contextual ticket similarity with respect to
a tuple in which we aim to predict similarity between a given pair
of tickets in a tuple, ticket ranking in which aim to retrieve semantically equivalent tickets with respect to a given test ticket, and
resolution recommendation in which we aim to suggest resolutions
for a given ticket [22]. We carry out an extensive experimentation
on these tasks. Our technique shows an appreciable improvement
of 56% over noncontextbased DNN approach for most of these
semantic similarity tasks. The contributions of our work lie in
proposing an approach of injecting context into deep neural networks and showing an input of such additional information to deep
neural networks improves the representation of them for various
similarity tasks.
1.1
Related Work
Neural networks are effectively used to compute semantic similarity
between documents [8, 23, 26]. Deep learning has also been used
to find similarity between two short texts. Hu et.al. [11] have
used convolutional neural networks for matching two sentences.
The approach could nicely represent the hierarchical structures of
sentences with their layerbylayer composition and pooling and
thus capture the rich matching patterns at different levels. Lu and
Li [20] have proposed a deep architecture that can find a match
between two objects from heterogeneous domains. In particular,
they apply their model to match short texts meant for task such as
finding relevant answers to a given question and finding sensible
responses for a tweet. In [27], convolutional neural network has
been used for ranking pairs of short texts, wherein the optimal
representation of text pairs and a similarity function are learnt
to relate them in a supervised manner. Long Shortterm memory
(LSTM) also used to find similarity between sentences in [3, 21].
The idea of using FFDNN in our work originates from work on
learning deep structured latent models for web search [9, 10, 13].
Motivated by these ideas we have used deep neural network models to recommend resolutions for tickets in ITIL services [22]. We
project existing tickets and an incoming ticket to a common low dimensional space and then compute the similarity of the new ticket
with other tickets. We select the ticket which has highest similarity
with the new ticket and pick up the resolution of the former as the
recommended resolution for the new ticket. In this work, we integrate context into deep neural networks for computing similarity
between two tickets (the motivation came from [2]). which is a
Figure 1: Snapshot of relevant parts of incident ticket data
novelty of our work. In addition to feature vector, a context vector
for a ticket is also injected into the deep neural network to obtain
a combination of two low dimensional vectors which are then used
to compute the similarity of a pair of tickets. Recently Amiri et
al. have used an extended framework of deep auto encoders with
context information to learn text pair similarity in [2]. In that the
authors use context information as low dimensional vectors which
are injected to deep autoencoders along with text feature vectors
for similarity computation. While the authors use auto encoders
for finding similarity of texts, in this work we use feed forward
deep neural network as it provides a more suitable framework to
compute the similarity between tickets.
Organization of the paper The paper is organized as follows.
We describe the ticket schema that we consider in Section 2. The
Feed forward Deep Neural Network (FFDN) used for text similarity
is introduced in Section 3. The contextsensitive Feed forward
Deep Neural Network is introduced in Section 4. We describe our
approach to compute similarity between two tickets using contextsensitive FFDNN in Section 5. Experimental results are discussed
in Section 6. Finally we conclude in Section 7.
2
TICKET DATA SET
We consider incident tickets with similar schema which are frequent in ITIL. These tickets usually consist of two fields, fixed and
free form. Fixedfields are customized and inserted in a menudriven fashion. Example of such items are the ticket’s identifier,
the time the ticket is raised or closed on the system or, if a ticket
is of incident or request in nature. Various other information are
captured through these fixed fields such as category of a ticket,
employee number of the user raising the ticket etc, and also Quality
of Service parameters like response time, resolution time of the
ticket etc. There is no standard value for freeform fields. The concern/issue for raising a ticket is captured as “call description” or
“summary” as freeformed texts,  it can be a just a sentence that
summarizes the problem reported in it, or it may contain a detailed
description of the incident. By using this freely generated part of
tickets, administrators can get to know about unforeseen network
incidents and can also obtain a much richer classification. A small
note is recorded as resolution taken for each ticket. A small part of
ticket data is shown in Figure 1.
Learning Ticket Similarity with Contextsensitive Deep Neural Networks
2.1
Feature vector creation from Ticket Data
We assume a free field of a ticket to contain a succinct problem
description associated with it in the form of a summary (or call
description). We extract features from the collection of summaries
of tickets using light weight natural language processing. As a
preprocessing we remove the tickets which do not contain a summary/resolution. In the beginning we lemmatize the words in the
summary of tickets. Then we use Stanford NLP tool to parse the
useful contents in the summary of the tickets and tag them as tokens. Next we set up some rules for removing tokens which are stop
words. We compute document frequency (DF)1 of each lemmatized
word. We discard any word whose DF is smaller than 3. A ticket
summary may contain some rare words like, name of a person (user
who raised the ticket) and some noise words. By discarding words
with DF less than 3, we can remove these rare words which do
not contribute to the content of ticket summary. In this way, the
feature vector size could be reduced significantly. We perform some
other preprocessing and select bigrams and trigrams as terms
(keyphrases), the details of which are described in [24]. Finally, a
profile of a ticket is given as, T = (x 1 , . . . , x n ) = x,
® where x 1 , . . . , x n
are the appropriate weights for the chosen words w 1 , . . . , w n respectively from the summary of T . We shall use TF*IDF [25] of a
word as its weight2 .
2.2
Relational schema on fixed elements
The fixed field entries of a ticket can be represented using a relational schema. For that we shall consider only a limited number of
fixed fields of a ticket for choosing attributes that reflect its main
characteristics (the domain experts’ comments play an important
role in choosing the fixed fields), for example, the attributes can
be,  application name, category and subcategory. They can be
represented as a tuple: Ticket(application name, category and subcategory). Each of the tuples corresponding to entries in the fixed
fields in the ticket can be thought of an instantiation of the schema.
Examples of rows of such schema can be, (AS400  Legacy Manufacturing, Software, Application Errors), (AS400 Legacy  Retail,
Software, Application Functionality Issue) etc. The relation key can
vary from 1 to number of distinct tuples in the schema. One such
key can hold several Incident IDs, that is, it can contain several
tickets with different IDs.
3
FEED FORWARD DEEP NEURAL NETWORK
FOR TEXT SIMILARITY
Deep learning (DL) is based on algorithms that learn multiple levels
of representation and abstractions in data. It typically uses artificial
neural networks with several layers. These are called deep neural
networks.
3.1
Conference’17, July 2017, Washington, DC, USA
Architecture of Feed forward Deep Neural
Network
Feed forward Deep Neural Networks (FFDNNs) have multiple
hidden layers. Each layer in a feed forward Deep Neural Network
(FFDNN) adds its own level of nonlinearity that can solve more
1 Document frequency of a word is the number of tickets (ticket summaries) containing
the word in the data set (corpus) [17]
2 TF*IDF is a popular metric in the data mining literature [17]
Figure 2: A schematic diagram of our Approach
Figure 3: Architecture of Feed Forward Deep Neural Network for text similarity
complex problems. The FFDNN used in this paper to find similar
texts process feature vectors of texts, works in two stages. First,
the FFDNN maps highdimensional sparse text feature vector of a
document (text) layer by layer into lowdimensional feature vector.
In the next stage, the low dimensional feature vectors are passed
through cosine similarity function nodes to compute the similarity
between two texts.
The architecture of feed forward deep neural network FFDNN [13]
that we use for our purpose here is shown in Fig 3. Given a text
(document) the objective of the FFDNN model is to find the text
(document) from the existing texts (documents) that is the most similar wrt the new one. To fulfill this goal, we can train the FFDNN
with a set of texts along with the similar and dissimilar texts.
3.2
Structure of the FFDNN
We have used the FFDNN shown in Fig 3 to find similarity between
the new ticket and a set of existing tickets in [22]. Prior to computing similarity, the FFDNN reduces the high dimensional feature
vectors representing summaries of tickets into lowdimensional
vectors. To accomplish this it uses DNNR, a multilayer feedforward
deep neural network that reduces dimension.
The structure of the DNNR is given in Fig 5. The input layer
of DNNR consists of n number of nodes where n is the size of the
feature vector of document. Let there be N − 1 number of hidden
layers. It has one output layer which is N th layer of the network
(excluding input layer). Let x® be the input feature vector, y as output
vector. Let hi , i = 1, 2, . . . , N − 1 be the ith intermediate hidden
layer, Wi be the i th weight matrix and bi be the i th bias vector. We
have then
h 1 = W1x®
hi = f (Wi hi−1 + bi ), i = 2, 3, . . . , N − 1.
y = f (WN h N −1 + b N )
(1)
Conference’17, July 2017,
Durga
Washington,
Prasad Muni,
DC, USA
Suman Roy, Yeung Tack Yan John John Lew Chiang, Navin Budhiraja and Iheb Ben Abdallah
Figure 4: Structure of Feed Forward Deep Neural Network
during training
We consider tanh(z) as the activation function at the output
layer and at the hidden layers. The activation function is defined as
1 − e −2z
f (z) = tanh(z) =
(2)
1 + e −2z
The output of DNNR is passed through a cosine similarity function as shown in Fig 3.
4
combined output feature vector for Tm and (y i , yci ) be the combined
output feature vector for Ti .
The cosine similarity between the output vector (ym , ycm ) and
another output vector (y i , yci ) are computed using Eqn 4 given by
the cosine similarity R(Tm ,Ti ) as below:
CONTEXTSENSITIVE FFDNN
We now extend FFDNN to incorporate context information about
inputs. For each ticket T in the training set represented with its
feature vector as x® ∈ Rn , we have generated a context vector
c®x ∈ Rk containing contextual information about the input. The
input and the target task can determine the nature of the context
vector.
We need to pass the context vector c®x through DNNRC that
reduces the dimension. Here DNNR and DNNRC have the same
structure, but they accept different weight and bias parameters,
see Fig 5 (indexed by W and V respectively). For producing lowdimensional vectors through DNNRC we follow the same approach
as discussed for the basic FFDNN captured by Eqn 1. Here we assume li , i = 1, 2, . . . , N − 1 be the ith intermediate hidden layer representation of DNNRC, Vi be the i t h weight matrix and di be the i th
bias vector. The FFDNN maps the inputs c®x to the contextsensitive
representations l 2 , l 3 , . . . l N −1 at hidden layers 2, 3, . . . , N − 1, and
a contextsensitive output yc given by Eqn 3.
l 1 = V1c®x
li = f (Vi li−1 + di ), i = 2, 3, . . . , N − 1.
(3)
yc = f (VN l N −1 + d N )
Each ticket T is now represented as a context rich feature vector
(®
y, y®c ) for further analysis.
4.1
Figure 5: DNNR or DNNRC: Part of DNN that reduces the
dimension
Training of the FFDNN
The structure of the Context based FFDNN during training is given
in Fig 4. To train the (context based) FFDNN, we take a set of M
ticket summaries, {Tm : m = 1, 2, . . . , M }. Each of the summary
of ticket Tm is coupled with one similar ticket Tim+ and three dissimilar tickets Tim−
0 . These four similar and dissimilar tickets are
represented by a set Tm .
The given ticket Tm and each of the similar and dissimilar tickets
are fed to the DNNR one by one. The structure of the contextsensitive FFDNN that we use is shown in Fig 6. Let (ym , ycm ) be the
R(Tm ,Ti ) = cos(ym , y i ) + λ cos(ycm , yci ) =
+λ∗
ycm T yci
ycm yci 
ymT y i
ym y i 
(4)
, λ ∈ [0, 1]
These Rvalues of similar and dissimilar tickets wrt Tm are supplied
to the Softmax function as shown in Fig 4. The Softmax function
computes posterior probabilities [13]. The posterior probability for
R(Tm ,Tim ) is given in Eqn 5 as below.
P(Tim Tm ) = Í
exp(γ R(Tm ,Tim )
,
exp(γ R(Tm ,Tim
0 ))
Tim
0 ∈Tm
(5)
where γ is the smoothing parameter in the Softmax function. As
our objective is to find the most similar ticket for a given ticket Tm ,
we maximize the posterior probability for the similar (or positive)
tickets. Alternatively, we minimize the following loss function
L(Ω) = − log
Ö
P(Tim+ Tm )
(6)
(Tm ,Tim+ )
where Ω denotes the set of parameters {Wi , bi , Vi , di : i =
1, 2, . . . , N } of the neural networks. L(Ω) is differentiable wrt Ω as
it is continuous and its (partial) derivatives are also continuous (see
Section A). So, the FFDNN can be trained using gradientbased
numerical optimization algorithm. The parameters in Ω are updated
as
∂L(Ω)
Ωt = Ωt −1 − ϵt

,
(7)
∂Ω Ω=Ωt −1
where ϵt is the learning rate at the t th iteration, Ωt and Ωt −1 are
the model parameters at the t th and (t − 1)th iteration, respectively.
For other details see [13, 22].
4.2
Context Extraction
Context relates to the information content present in the summary
of tickets. As context information is not available with the ticket
description we resort to topic models [5, 29] to obtain contexts for
Learning Ticket Similarity with Contextsensitive Deep Neural Networks
Conference’17, July 2017, Washington, DC, USA
Table 1: Statistics of Ticket Data from Different Domain
Figure 6: Structure of Context Sensitive Feed Forward Deep
Neural Network for text similarity
each individual ticket. Given feature vector representation of tickets T1 , . . . ,Tn as x®1 = (x 1,1 , . . . , x 1,m ) . . . , x®n = (x n,1 , . . . , x n,m )
respectively we can compute the ticketterm matrix X as
x 1,1
.
X = ..
x
n,1
x 1,m
..
.
x n,m
The matrix X is of dimension n × m where the number of tickets
and terms are n and m respectively. Recall each element x i, j here
denotes the TF*IDF value of term j in Ticket i. Generally the rows of
X are normalized to have unit Euclidean length. Assuming the given
ticket corpus to have k topics the goal is to factorize X into two
nonnegative matrices C (of dimension n × k) and D (of dimension
k ×m). Here C is the tickettopic matrix in which each row specifies
each ticket as a nonnegative combination of topics. It is desired
that each topic will be associated with few minimum terms as
possible and hence, D can be assumed to be sparse. Subsequently,
we consider the following minimization problem:
1
(8)
argmin kX − CDk 2F + α kDk 1 + β kCk 2F ,
C,D≥0 2
···
..
.
···
where k·k F denotes the square root of the squared sum of all the
elements in the matrix (also called Frobenius norm), and k·k 1 is the
L 1 norm. The last two terms in Equation 8 are the regularization
terms; and α and β are two regularization parameters which are
used to control the strength of regularization. The sparse NMF
formulation can be easily solved using the Block Coordinate Descent (BCD) method by breaking the original problem into two
subproblems, the details of which can be found in [14].
We adopt an approximation technique to obtain context vector
for test instances by using the fitted NMF model. As the fitted
−1
model is X = CD we can get C = XDT DDT
= XF, where
−1
F = DT DDT
. For new ticket Tnew we construct the feature
vector as x®new = (x 1new · · · x nnew ). Then one can get context
vector c®new = x®new F. If any of the entries in c®new is nonpositive
then it is made to be equal to zero.
5
TICKET SIMILARITY USING CONTEXT
SENSITIVE FFDNN
We adopt the following approach for computing similarity scores
for a pair of input tickets endowed with their corresponding context
Domain
Total
Tickets
No. of
tuples
Input Feature
vector dimension
SnA
FnB
Ret
5011
50666
14379
42
82
40
2044
1584
1052
Context
vector dimension
400
320
200
information. For a pair of tickets T1 and T2 with their feature vector
representations x®1 and x®2 , we obtain their context representations
cx
® 1 and cx
® 2 respectively using the method mentioned above. Given
these context enriched representation of two ticket vectors (®
x 1 , cx
® 1)
and (®
x 2 , cx
® 2 ) we pass them as inputs to the already constructed
DNNs, DNNR and DNNRC respectively as shown in Figure 3, which
produce outputs (y1 , y1c ) and (y2 , y2c ) respectively. Then we compute
the similarity between these two tickets using Eqn 4.
Sim D N N (T1 ,T2 ) = R(T1 ,T2 ) = cos(y1 , y2 ) + λ cos(y1c , y2c )
=
6
yc T yc
y 1T y 2
+ λ ∗ c1 2c , λ ∈ [0, 1]
y1 y2 
y1 y2 
(9)
EXPERIMENTAL RESULTS
We discuss about the experiments that we conduct on IT maintenance ticket data belonging to Infosys Ltd. We implement these
deep learningbased approaches for ticket similarity using Java.
Also we use neurophcore2.92 library3 to implement FFDNN.
6.1
Ticket data
We have used data sets from three domains in ITIL services to
validate our methodology. These domains are Sports and Apparel
(SnA), Food and Beverages (FnB, in short) and Retail (Ret in short).
The data from SnA domain portrays issues related to sports and
apparel industry. It has about 5011 tickets. As mentioned earlier,
the ticket summaries are preprocessed and are represented by
TF*IDF feature vectors. FnB tickets contain information related to
food and beverages sector and contains 50666 tickets. The Retail
data captures information on issues related to services to customers
through multiple channels of distribution. This data includes 14379
tickets which contains data related to services to customers. The
details of these data sets are given in Table 1.
6.2
Data Partition
We randomly pick 10% of each data set (tickets) as the test set and
90% of the data set as the training set. Again in the training set
out of 90% data we reserve 20% as a sample of M texts for training
neural networks as described in Section 4.1, the rest 70% are used to
pick similar and dissimilar tickets to compare with those M tickets.
That is, for each ticket of the training set (Tm , m = 1, 2, . . . , M),
we take one similar ticket and three dissimilar tickets from the
remaining 70% of the data set (with repetition).
3 can
be downloaded from http://neuroph.sourceforge.net/download.html
Conference’17, July 2017,
Durga
Washington,
Prasad Muni,
DC, USA
Suman Roy, Yeung Tack Yan John John Lew Chiang, Navin Budhiraja and Iheb Ben Abdallah
6.3
Semantic similarity tasks
We apply this ticket similarity framework to three semantic similarity tasks. We also validate the results on these tasks.
Contextual ticket similarity. We adopt a semisupervised approach for predicting similarity score between two tickets enriched
with context information. These scores can be used to compute the
semantic similarity of a group of tickets in the same tuple. Recall
that on using the DNNcentric method described in Section 5 we
can compute the similarity between two tickets. For validation
purposes, we need to determine the similarity score between two
tickets. But unfortunately, for the current corpus we do not have
any similarity score (ground truth) corresponding to a pair of tickets which is readily available. So, we assume two tickets T1 and T2
are similar if both the following conditions are satisfied.
• Both T1 and T2 belong to same tuple. In other words, T1
and T2 have same attributes for the corresponding chosen
fixed fields (e.g., category, subcategory, application name
and incident type).
• The cosine similarity score between T1 and T2 exceeds a
threshold value of 0.4, that is, cos(T1 ,T2 ) ≥ 0.4.
For validation of our context vector based FFDNN for ticket pair
similarity, we conduct the following experiment. We take a set of
pairs of similar tickets and a set of pair of dissimilar tickets from
test set using the concept of similarity proposed above. Let each
such pair Pi = (Ti1 ,Ti2 ) be fed to the contextbased FFDNN from
which the similarity score Ri = SimD N N (Ti1 ,Ti2 ) can be computed
using Eqn 9. Let the average similarity score for this set of pairs
of similar ticket be Λsim . Similarly, we can compute the similarity
score Ri for each pair of dissimilar tickets. For dissimilar set of
pairs of tickets, let the average similarity score be Λdis . We expect
that Λsim should be sufficiently larger than Λdis .
Ticket Ranking. Using this framework for a given ticket we can
find out tickets which are very similar to the former. Towards that,
for a given ticket T in tuple τ we compute the DNNbased similarity
score for each pair of ticket SimD N N (T ,Ti ) for Ti in the tuple τ .
Based on this value we can find topk similar tickets (k = 1, 3, 5, 10)
for a given ticket T .
Resolution recommendation for incoming tickets. For a given
ticket T once we find topk similar tickets we pick out the resolution corresponding to each of these k tickets. Then we publish
these resolutions as the suggested resolution for the ticket T . We
validate the model on resolution recommendation using the test
set as follows. We compare the actual resolution of each test ticket
with the recommended resolution(s) using semantic similarity [18]
score ranging from 0 to 1. In this approach (with SS approach) the
similarity of two short sentences is computed based on descriptive features of these sentences. Then we can also compute the
average semantic similarity score over all recommended cases. As
resolutions are short text probably the SS approach is not not able
to reflect the similarity content of a pair of resolutions properly.
6.4
Performance Analysis
The FFDNN approach facilitates for various sizes of the model
with many hidden layers, and each layer in turn, contains different
Table 2: Size of DNNR and DNNRC
Domain
DNNR Size
SnA
FnB
Ret
20441000500250
1584800400200
1052500250125
DNNRC size
40020010050
3201608040
2001005025
Table 3: Ticket pair similarity for SnA Data Set
Approach
Λsim
Λdis
(Λsim − Λdis )
FFDNN
Context FFDNN
0.850
1.116
0.544
0.723
0.306
0.393
Table 4: Ticket pair similarity for FnB Data Set
Approach
Λsim
Λdis
(Λsim − Λdis )
FFDNN
Context FFDNN
0.669
1.031
0.001
0.409
0.668
0.622
number of nodes. However in this paper, we have considered
models with two hidden layers. That means, the number of total
layers excluding the input layer is N = 2 + 1 = 3. We may consider
a very deep neural network with many hidden layers. But it may
fail to perform better due to poor propagation of activations and
gradients [28].
We train the FFDNN with 100 epochs or iterations. We also
train the model with different values of learning rate parameter ϵ
(Eqn 10). Out of these trained models, we consider the model for
which the learning curve (loss function vs epoch) has been steadily
decreasing.
In this experiment, we have taken λ = 0.3 (Eqn. 4). So, the value
of ticket pair similarity varies between 1 and 1.3 for context based
FFDNN.
The sizes of DNNR and DNNRC for different data sets are given
in Table 2. The notation 2044 − 1000 − 500 − 250 means the size of
input feature vector is 2044 and the first, second hidden layers and
the output layer contain 1000, 500 and 250 units respectively. In a
given layer, we roughly take half of units that of the previous layer.
Contextual ticket similarity: As explained in section 6.3, we compute Λsim for a set of pairs of similar tickets and Λdis for a set of
pairs of dissimilar tickets. For our experiment, we have taken the
set of 100 pairs of similar tickets and 100 pairs of dissimilar tickets
for comparison. These two average similarity scores Λsim and Λdis
are given in Tables 3, 4, and 5 for the three data sets. The contextbased FFDNN performed better for SnA and Ret data as compared
to FFDNN. As the domain FnB contains much larger number of
tickets than SnA and Ret probably while picking dissimilar tickets
we pick up tickets which are much more dissimilar than the given
ticket in comparison with the other two domains. Also because
of higher number of tickets adequate number of topics could not
have been properly captured.
Ticket Ranking: We compare the actual summary of each test
ticket with the top k summaries using semantic similarity (with SS
Learning Ticket Similarity with Contextsensitive Deep Neural Networks
Conference’17, July 2017, Washington, DC, USA
Table 5: Ticket pair similarity for Ret Data Set
Approach
Λsim
Λdis
(Λsim − Λdis )
FFDNN
Context FFDNN
0.904
1.178
0.345
0.475
0.559
0.703
Table 9: Resolution Recommendation for SnA Data Set
Table 6: Ticket Summary Ranking for SnA Data Set
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
% of improvement
0.497
0.524
5.4
0.531
0.559
5.3
0.538
0.566
5.2
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
0.366
0.361
0.459
0.458
0.485
0.484
top@10
0.511
0.517
Table 10: Resolution Recommendation for FnB Data Set
top@10
0.547
0.573
4.8
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
0.399
0.392
0.497
0.498
0.529
0.532
top@10
0.559
0.562
Table 11: Resolution Recommendation for Ret Data Set
Table 7: Ticket Summary Ranking for FnB Data Set
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
% of improvement
0.784
0.780
−0.5
0.811
0.796
−1.8
0.814
0.819
0.6
top@10
0.815
0.822
0.9
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
0.513
0.513
0.604
0.606
0.633
0.645
top@10
0.662
0.672
Table 12: Evaluation of recommended resolutions with
FFDNN for top@10
Table 8: Ticket Summary Ranking for Ret Data Set
Approach
top@1
top@3
top@5
FFDNN
Context FFDNN
% of improvement
0.657
0.689
4.9
0.718
0.762
6.1
0.736
0.778
5.7
Domain
Manual
Evaluation
SnA
Ret
0.564
0.710
top@10
0.747
0.789
5.6
approach) score which ranges from 0 to 1. Then we compute the
average semantic similarity score over all test tickets, see Tables 6, 7,
and 8. It can be seen that contextbased FFDNN performs better
by approximately 5% over the simple FFDNN approach for the
data sets SnA and and by 6% for the data set Ret. However, the
performance goes down for FnB data set. This might be because
of the same reason for contextual similarity task.
Resolution recommendation for incoming tickets: As given in section 6.3, we recommend resolutions for a new ticket. The results are
given in Tables 9, 10, and 11. It shows that context based FFDNN
has performed marginally better than FFDNN.
To determine the effectiveness of semantic similarity(SS) approach, we manually evaluated recommended resolutions with
FFDNN approach for two data sets. We inspect the actual resolution of each ticket and the corresponding recommended resolutions
for top 10 case. We use three similarity scores of 0, 0.5 and 1. If the
meaning of a pair of actual resolution and recommended language
appear to be the same (using meta language oriented informal
semantics) then we assign a similarity score of 1 to this pair. If we
find that the meaning of the elements of this pair are not exactly
same, but there is some match then we provide a score of 0.5 to
this pair. Otherwise (in case of the resolutions completely differing
in their meaning) we score this pair 0. As before, we calculate
the average manual similarity score over all test tickets. Table 12
shows that manual scoring of the similarity is slightly higher than
automated evaluation by SS approach.
7
SS
Evaluation
0.511
0.662
CONCLUSIONS
In this work we have integrated context with deep neural network
to compute similarity between tickets used in ITIL services. Our
learning algorithm seems to improve the performance of similarity computation without contexts using deep network only. Also
we could see some improvement in the representation of other
similarity tasks. In future we would like examine other context
models like PoS tag, word dependency information, word sense,
domain ontology and integrate with similar learning framework
for performing semantic similarity tasks as discussed in the paper.
This will help handle automation of ticketing systems in different
stages in addition to automation of several incident management,
monitoring and event management tasks.
A
GRADIENT DESCENT
We now formulate the gradient descent algorithm in our framework.
This formulation is based on the one given in [22]. However, we
modify the derivation by taking into account context vectors.
The DNN is trained using gradientbased numerical optimization algorithms [13] because L(Ω) is differentiable wrt Ω. Ω consists of weight matrices Wk and Vk and bias vectors bk and dk ,
k = 2, 3, ..., N . The parameters in (Ω) are updated as
Ωt = Ωt −1 − ϵt
∂L(Ω)

,
∂Ω Ω=Ωt −1
(10)
where ϵt is the learning rate at the t th iteration, Ωt and Ωt −1 are
the model parameters at the t th and (t −1)th iteration, respectively.
Conference’17, July 2017,
Durga
Washington,
Prasad Muni,
DC, USA
Suman Roy, Yeung Tack Yan John John Lew Chiang, Navin Budhiraja and Iheb Ben Abdallah
A part of this derivation was presented in [22]. In this work, we
consider individual tickets instead of pair of tickets to compute
similarity. Moreover, we consider context vectors also along with
the ticket vectors.
Let M be the number of the ticket summaries (Tm ). For each of
the M tickets, we consider the combination of a similar (positive)
ticket summary Tim+ and three dissimilar (negative) ticket summaries T jm− : 1 ≤ j ≤ 3 for training the DNN. We can denote each
mth combination (Tim+ ,T jm− ) as Tm .
Then we can write
L(Ω) = L 1 (Ω) + L 2 (Ω) + · · · + Lm (Ω) + · · · + L M (Ω),
where Lm (Ω) = − log P(Tim+ Tm ), 1 ≤ m ≤ M
M
∂L(Ω) Õ ∂Lm (Ω)
and,
=
∂Ω
∂Ω
m=1
(11)
(12)
=
∂WN
∂R(Tm ,Tim+ )
∂WN
and
αm
j
=
−γ
Í
−
∂WN
=
∂
∂WN
= δym (Tm ,Ti )hTN −1,Tm
+λ
"
c T yc
Ty
ym
ym
i
i
+λ c
ym yi 
ym yic 
λ(δyc c (cm ,c i )l TN −1,cm
m
#
(21)
+ δyc c (cm ,c i )l TN −1,c i )
i
where,
c
c
c
δyc c (cm ,c i ) = (1 − ym
) ◦ (1 + ym
) ◦ (u c v c yic − ac v c u c 3ym
)
c
δyc c (cm ,c i ) = (1 − yic ) ◦ (1 + yic ) ◦ (u c v c ym
− ac u c v c 3yic )
i
1
1
c
c  , v = y c 
ym
i
For hidden layers, we also need to calculate δ for each ∆m
j . We
calculate each δ in the hidden layer k through back propagation as
c T c c
ac = ym
yi , u =
δk,Tm (Tm ,Ti ) = (1 + hk,Tm ) ◦ (1 − hk,Tm ) ◦ WkT+1δk +1,Tm (Tm ,Ti )
m
δk,Ti (Tm ,Ti ) = (1 + hk,Ti ) ◦ (1 − hk,Ti ) ◦ WkT+1δk +1,Ti (Tm ,Ti )
i
i
(17)
c T yc
ym
i
(22)
with
(cm ,c i )
c
δ N ,Tm (Tm ,Ti ) = δym (Tm ,Ti ) , δ N
= δyc c (cm ,c i )
,cm
m
(cm ,c i )
c
δ N ,Ti (Tm ,Ti ) = δyi (Tm ,Ti ) , δ N
= δyc c (cm ,c i )
,c i
i
#
c y c 
ym
i
(18)
+ δyi (Tm ,Ti )hTN −1,Ti
The gradient of the loss function w.r.t the intermediate weight
matrix, Wk , k = 2, 3, . . . , N − 1, can be computed as
m
∂Lm (Ω) Õ m ∂∆ j
=
αj
∂Wk
∂Wk
m−
where,
δym (Tm ,Ti ) = (1 − ym ) ◦ (1 + ym ) ◦ (uvyi − avu 3ym )
δyi (Tm ,Ti ) = (1 − yi ) ◦ (1 + yi ) ◦ (uvym − auv 3yi )
1
1
,v =
a
=
ym 
yi 
The operator ‘◦’ denotes the elementwise multiplication.
T
= ym
yi , u
The gradient of the loss function w.r.t the Nth weight matrix
VN of DNNRC is
m
∂Lm (Ω) Õ m ∂∆ j
=
αj
(19)
∂VN
∂VN
m−
Tj
∂
=
∂VN
(20)
∂VN
c (cm ,c i )
δk,c
= (1 + lk,c i ) ◦ (1 − lk,c i ) ◦ VkT+1δkc +1,c (cm ,c i )
exp(−γ ∆m
j )
ym yi 
=
(16)
∂WN
Ty
ym
i
∂VN
∂VN
∂R(Tm ,T jm− )
−
m
1 + T m−
exp(−γ ∆m
j 00 )
j 00
"
∂R(Tm ,Tim )
∂R(Tm ,Tim+ )
(cm ,c i )
(cm ,c i )
c
c
δk,c
= (1 + lk,cm ) ◦ (1 − lk,cm ) ◦ VkT+1δk+1,c
∂R(Tm ,T jm− )
Let ym and yi be the outputs of DNNR with Tm and Ti ticket
c and y c be the outputs of DNNRC with context
summaries. Let ym
i
vectors cm and c i of Tm and Ti ticket summaries respectively.
∂R(Tm ,Tim )
∂VN
=
m
Tj
∂∆m
j
∂∆m
j
(13)
ª
Õ
©
®
On simplifying, Lm (Ω) = log 1 +
exp −γ ∆m
(14)
j ®
T jm−
«
¬
m+ ) − R(T ,T m− )
where ∆m
=
R(T
,T
m
m
j
i
j
The gradient of the loss function w.r.t the Nth weight matrix
WN is
m
∂Lm (Ω) Õ m ∂∆ j
=
(15)
αj
∂WN
∂WN
m−
where
where
(23)
Tj
where
∂∆m
j
∂Wk
=
∂R(Tm ,Tim+ )
−
∂R(Tm ,T jm− )
∂Wk
∂Wk
(Tm ,Ti ) T
(Tm ,Ti ) T
= δk,Tm
hk −1,T + δk,Ti
hk−1,T
m
i
(Tm ,T j ) T
(Tm ,T j ) T
−δk,Tm
hk −1,T − δk,T j
hk−1,T
m
j
(24)
The gradient of the loss function w.r.t the intermediate weight
matrix, Vk , k = 2, 3, . . . , N − 1, can be computed as
m
∂Lm (Ω) Õ m ∂∆ j
=
αj
∂Vk
∂Vk
m−
Tj
(25)
Learning Ticket Similarity with Contextsensitive Deep Neural Networks
where
∂∆m
j
∂Vk
=
∂R(Tm ,Tim+ )
∂Vk
−
∂R(Tm ,T jm− )
∂Vk
(cm ,c i ) T
c
c (cm ,c i ) T
= λ(δk,c
lk −1,c + δk,c
lk−1,c
m
i
m
i
(cm ,c j ) T
c
c (cm ,c j ) T
−δk,c
lk −1,c − δk,c
lk −1,c )
m
j
m
j
(26)
Similarly, the gradient of loss function w.r.t bias can be derived.
The partial derivation of R(Tm ,Tim+ ) wrt bias b N and bk , k =
2, 3, ..., N − 1 can be derived as:
∂R(Tm ,Tim )
= δym (Tm ,Ti ) + δyi (Tm ,Ti )
(27)
= δk,Tm (Tm ,Ti ) + δk,Ti (Tm ,Ti )
(28)
∂b N
∂R(Tm ,Tim )
∂bk
The partial derivation of R(Tm ,Tim+ ) wrt bias d N and dk , k =
2, 3, ..., N − 1 can be derived as:
∂R(Tm ,Tim )
∂d N
∂R(Tm ,Tim )
∂dk
= δyc c (cm ,c i ) + δyc c (cm ,c i )
(29)
(cm ,c i )
c
c (cm ,c i )
= δk,c
+ δk,c
(30)
m
m
i
i
REFERENCES
[1] ACL’15 2015. Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing of the Asian Federation of Natural Language Processing, ACL
2015, July 2631, 2015, Beijing, China, Volume 1 & 2: Long and Short Papers. The
Association for Computer Linguistics. http://aclweb.org/anthology/P/P15/
[2] Hadi Amiri, Philip Resnik, Jordan BoydGraber, and Hal Daum´e III. 2016. Learning text pair similarity with contextsensitive autoencoders. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (ACL’16),
Vol. 1. 1882–1892.
[3] Chao An, Jiuming Huang, Shoufeng Chang, and Zhijie Huang. 2016. Question
Similarity Modeling with Bidirectional Long Shortterm Memory Neural Network.. In Proceedings of IEEE First International Conference on Data Science in
Cyberspace.
[4] Yoshua Bengio. 2009. Learning Deep Architectures for AI. Foundations and
Trends in Machine Learning 2, 1 (2009), 1–127.
[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
[6] Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (Almost)
from Scratch. Journal of Machine Learning Research 12 (2011), 2493–2537.
[7] Li Deng, Xiaodong He, and Jianfeng Gao. 2013. Deep stacking networks for
information retrieval. In IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP’13, Vancouver, BC, Canada. 3153–3157.
[8] C´ıcero Nogueira dos Santos, Luciano Barbosa, Dasha Bogdanova, and Bianca
Zadrozny. 2015. Learning Hybrid Representations to Retrieve Semantically
Equivalent Questions, See [1], 694–699. http://aclweb.org/anthology/P/P15/
[9] Jianfeng Gao, Xiaodong He, and JianYun Nie. 2010. Clickthroughbased translation models for web search: from word models to phrase models. In Proceedings
of the 19th ACM Conference on Information and Knowledge Management, CIKM’10.
1139–1148.
[10] Jianfeng Gao, Kristina Toutanova, and Wentau Yih. 2011. Clickthroughbased
latent semantic models for web search. In Proceeding of the 34th International
ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR’11. 675–684.
[11] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional
neural network architectures for matching natural language sentences. In Advances in neural information processing systems. 2042–2050.
[12] Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng.
2012. Improving Word Representations via Global Context and Multiple Word
Prototypes. In The 50th Annual Meeting of the Association for Computational
Linguistics, Proceedings of the Conference, ACL’12: Long Papers. 873–882.
Conference’17, July 2017, Washington, DC, USA
[13] PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P.
Heck. 2013. Learning deep structured semantic models for web search using
clickthrough data. In 22nd ACM International Conference on Information and
Knowledge Management, CIKM’13. 2333–2338.
[14] Da Kuang, Jaegul Choo, and Haesun Park. 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional
Clustering Algorithms. Springer, 215–243.
[15] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences
and Documents. In Proceedings of the 31th International Conference on Machine
Learning, ICML’14. 1188–1196.
[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature
521 (2015), 436–444.
[17] Jure Leskovec, Anand Rajaraman, and Jeff Ullman. 2014. Mining of Massive
Datasets (2nd ed.). Cambridge University Press.
[18] Yuhua Li, David McLean, Zuhair Bandar, James O’Shea, and Keeley A. Crockett.
2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE
Trans. Knowl. Data Eng. 18, 8 (2006), 1138–1150.
[19] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierarchical Recurrent Neural Network for Document Modeling. In Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, EMNLP’15.
899–907.
[20] Zhengdong Lu and Hang Li. 2013. A deep architecture for matching short texts.
In Advances in Neural Information Processing Systems. 1367–1375.
[21] Jonas Mueller and Aditya Thyagarajan. 2016. Siamese Recurrent Architectures
for Learning Sentence Similarity.. In AAAI. 2786–2792.
[22] D. P. Muni, S. Roy, Y. T. Y. J. John L. Chiang, A. JM Viallet, and N. Budhiraja.
2017. Recommending resolutions of ITIL services tickets using Deep Neural
Network. In Proceedings of the 4th IKDD Conference on Data Science, CODS.
[23] Sascha Rothe and Hinrich Sch¨utze. 2015. AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes, See [1], 1793–1803. http:
//aclweb.org/anthology/P/P15/
[24] S. Roy, D. P. Muni, JJ. Yeung T. Y., N. Budhiraja, and F. Ceiler. 2016. Clustering
and Labeling IT Maintenance Tickets. In ServiceOriented Computing  14th
International Conference, ICSOC 2016, Banff, AB, Canada, Proceedings. 829–845.
[25] G. Salton and C. Buckley. 1988. Term Weighing Approaches in Automatic Text
Retrieval. Information Processing and Management (1988).
[26] Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to Rank Short Text
Pairs with Convolutional Deep Neural Networks. In Proceedings of the 38th
International ACM SIGIR Conference on Research and Development in Information
Retrieval’15. 373–382.
[27] Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text
pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, 373–382.
[28] Rupesh K Srivastava, Klaus Greff, and J¨urgen Schmidhuber. 2015. Training very
deep networks. In Advances in neural information processing systems. 2377–2385.
[29] Keith Stevens, W. Philip Kegelmeyer, David Andrzejewski, and David Buttler.
2012. Exploring Topic Coherence over Many Models and Many Topics. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, EMNLPCoNLL 2012,
July 1214, 2012, Jeju Island, Korea. 952–961.
Link to this page
Permanent link
Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by eMail, Messenger, Whatsapp, Line..
Short link
Use the short link to share your document on Twitter or by text message (SMS)
HTML Code
Copy the following HTML code to share your document on a Website or Blog