# First Graph-TA Program

The schedule is organized in such a way that the talks in each session are varied and usually do not have a strong relationship. The reason is that we will give time for discussion after each session, and the varied number of talks will allow for richer interaction.

**Link to venue of 1st Graph-TA 2013**

## 9:45 - 10:00 Registration

## 10:00 - 11:00 Presentation session I

### Horst Bunke. "Towards a Quadratic Time Approximation of Graph Edit Distance"

**Abstract:**Graph edit distance is a well established concept to measure the dissimilarity of graphs. It can be used for various purposes, for example, nearest-neighbor classification, clustering, and various kernel methods based on dissimilarity. However in its original form, its complexity is exponential. Therefore, various approximate procedures have been proposed for its computation. In this presentation we describe work towards a new approximate procedure for graph edit distance computation, based on Hausdorff distance, that runs in quadratic time. In an experimental evaluation, graph edit distance based features were used in a Hidden Markov Model recognizer for medieval handwritten text recognition. It turned out that the new graph edit distance runs more than ten times faster than another well established approximate procedure. Furthermore, there is no statistically significant difference of the recognition rate between the two procedures.

### Leen Lambers. "Automated Reasoning for Graph Properties and its Applications to Graph Database Design and Analysis"

**Abstract**: Data structures occurring in today’s most important problem domains - such as social networks, biological networks, transportation, 3D modeling, the world wide web, business process modeling to only name a few - can be expressed as graphs. In the last decade we can therefore observe an increasing success of graphs as a data structure in different computer science areas. In particular, there seems to be an urgent and increasing need to store, query and manipulate graph-like data demonstrated by the recent success of graph databases. This talk adresses the development of dedicated automated reasoning techniques for the design and analysis of graph database systems. In particular, this talk begins with a presentation of GL, a user-friendly visual language for expressing graph properties that is shown to be equivalent to first-order logic on graphs. It continues with an exploration of the current chances and limitations of dedicated automated reasoning techniques for GL. Finally, it will illustrate some application scenario's of these techniques to performing graph query optimizations as well as graph query sanity checks in graph database systems.

### Josep Ginebra. "A Cluster Analysis of Vote Transitions"

**Abstract**: To help settle the debate triggered the day after any election around the origin and destination of the vote of winners and losers, a Bayesian analysis of the results in a pair of consecutive elections is proposed. It is based on a model that simultaneously carries out a cluster analysis of the areas in which the results are broken into and links the results in the two elections of areas in a given cluster through a vote switch matrix. The number of clusters is chosen both through predictive checks as well as by testing whether the residuals are spatially correlated or not. The analysis is tried on the results of two pairs of consecutive elections in Catalonia, with the first election of the pair being for the Catalan parliament and the second election in the pair for the Spanish parliament.

### Marta Pérez. "Marshall-Olkin Extended Zipfian Distribution."

**Abstract**: The Zipfian distribution is the particular case of Power Law distribution that has support the strictly positive integers. In this paper the Zipfian distribution is generalized by means of the Marshall-Olkin transformation. The new model is a two-parameter model with more flexibility to adjust the probabilities of the first positive integer numbers while keeping the tail probabilities nearly invariant. The main properties of the new model are presented, and several data sets are analyzed in order to see the gain obtained by using the generalized model.

### Susana Ladra. "k2-tree for Web graph representation."

**Abstract**: The representation of large subsets of the World Wide Web in the form of a directed graph has been extensively used to analyze structure, behavior, and evolution of those so-called Web graphs. However, interesting Web graphs are very large and their classical representations do not fit into the main memory of typical computers, whereas the required graph algorithms perform inefficiently on secondary memory. Compressed graph representations drastically reduce their space requirements while allowing their efficient navigation in compressed form. We present the k2-tree, a Web graph representation that offers the least space usage of the state of the art while supporting fast navigation to predecessors and successors and sharply outperforming on extended functionality, such as checking for specific links between nodes, or between ranges of nodes, or listing the links between ranges.

### Josep Lluis Larriba-Pey. "Linked Data Benchmark Council."

**Abstract**: In this brief talk we will present the EC funded project “Linked Data Benchmark Council” (LDBC, www.ldbc.eu). The goals of the project are to create the first comprehensive suite of open, fair and vendor-neutral benchmarks for RDF/graph databases together with the LDBC foundation which will de fine processes for obtaining, auditing and publishing results. The project will carry a significant amount of work in the methodology of benchmark creation, the decision and evolution of the workloads which include the use case selection, creation of queries and generation of data.

## 11:00 - 12:20 Poster session I (in conjunction with coffee break)

## 12:20 - 13:30 Presentation session II

### Romualdo Pastor-Satorras. "Modeling face-to face social interaction networks: The role of human dynamics"

**Abstract**: Face-to-face interaction networks describe social interactions in human gatherings, and are the substrate for processes such as epidemic spreading and gossip propagation. The bursty nature of human behavior characterizes many aspects of empirical data, such as the distribution of conver- sation lengths, of conversations per person, or of inter-conversation times. Despite several recent attempts, a general theoretical understanding of the global picture emerging from data is still lack- ing. Here we present a simple model that reproduces quantitatively most of the relevant features of empirical face-to-face interaction networks. The model describes agents which perform a random walk in a two dimensional space and are characterized by an attractiveness whose effect is to slow down the motion of people around them. The proposed framework sheds light on the dynamics of human interactions and can improve the modeling of dynamical processes taking place on the ensuing dynamical social networks.

### Francesc Comellas. "Reconstruction of Complex Networks"

**Abstract**: We study the problem of reconstructing a network topology from a small set of values that represent it. For this study we have considered both the spectrum of the network and its betweenness centrality (a measure of the influence of each of its nodes in the dissemination of information over the network). We use two simple metaheuristics, tabu search and simulated annealing, as combinatorial optimization methods to generate the network from the values of the spectrum and the betweenness centrality and compare their performance when reconstructing different categories of networks –random, regular, small-world, scale-free and clustered–. We show that it is possible an exact reconstruction of small networks and good topological approximations for networks with larger orders.

### Ernest Valveny. "Vector Space Embedding of Graphs via Statistics of Labelling Information"

**Abstract**: In pattern recognition, graph-based representations are potentially interesting as they can express relations between parts of an object or a pattern in a very natural way. However, the high computational complexity of the algorithms for analyzing and matching graph-based patterns make them unsuitable for many real-life pattern recognition applications, where simpler representations based on numerical feature vectors are used. An elegant solution to combine the advantages of both types of representations is graph embedding, that is to transform the graph domain into a vector domain, mapping each graph to a point in a vector space. In this way, the rich set of algorithms working in the vector domain can automatically be applied to the graph domain. In this work we propose to associate feature vectors to graphs by putting attention on the labelling information of the graphs. In particular, we count frequencies of node labels and of edges between labels that, despite their locality, are able to robustly represent global properties of graphs, dealing with both discrete and continuous attributed graphs. We also show the experimental evaluation of this methodology applied to classification and clustering of patterns.

### Xavier Martínez. "Dynamic Data Partitioning for Distributed Graph Databases"

**Abstract**: We present a dynamic partitioning method for graph databases that combines minimization of network communication with load balancing and cache specialization to obtain high throughput. The system analyzes the incoming queries of the system, storing the data access patterns in compressed matrix form. This information is used to compute dynamically a partitioning that maximizes the query throughput while minimizing network communication. Compared to ParallelGDB, a previous distributed graph system based on cache specialization, throughput is up to an order of magnitude larger in the best case, and the average response time of the system is cut in half.

### Ricard Gavaldà. "Mining Frequent Closed Graphs on Evolving Data Streams"

**Abstract**: Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on time-varying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batch-incremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in non-stationary data streams.

### Renzo Angles. "Benchmarking essential graph queries"

**Abstract**: We have designed and implemented a benchmark for evaluating queries that can be considered essential in graph databases (e.g. adjacency and path queries). The data and queries used in the benchmark are based on a social network model as use-case. Our experiments include the evaluation of graph databases (Dex, InfiniteGraph, Neo4j, OrientDB), RDF databases (Allegro, BigData, Virtuoso, 4Store) and a relational database (PostgreSQL).

### Anjan Dutta. "A symbol spotting approach in graphical documents by hashing serialized graphs"

**Abstract**: In this talk we explore a symbol spotting technique in graphical documents with serialized subgraph matching. Graphs are used to represent the documents and a (sub)graph matching technique is used to detect the symbols in them. We propose a graph serialization to reduce the usual computational complexity of graph matching. Serialization of graphs is performed by computing acyclic graph paths between each pair of connected nodes. Graph paths are one dimensional structures of graphs which are less expensive in terms of computation. At the same time they enable robust localization even in the presence of noise and distortion. Indexing in large graph databases involves a computational burden as well. We propose a graph factorization approach to tackle this problem. Factorization is intended to create a unified indexed structure over the database of graphical documents. Once graph paths are extracted, the entire database of graphical documents is indexed in hash tables by locality sensitive hashing (LSH) of shape descriptors of the paths. The hashing data structure aims to execute an approximate k-NN search in a sub-linear time. We have performed detailed experiments with various datasets of line drawings and compared our method with the state-of-the-art works. The results demonstrate the effectiveness and efficiency of our technique.

## 13:30 - 15:00 Poster session II (in conjunction with lunch)

## 15:00 - 16:00 Presentation session III

### Francesc Serratosa. "Active & Interactive Graph Matching"

**Abstract**: We propose a method to perform active and interactive graph matching in which the active learner queries one of the nodes of the first graph and the oracle feedback is the corresponding node of the other graph. The method uses any graph matching algorithm that iteratively updates a probability matrix between nodes (Graduated Assignment, Expectation Maximisation or Probabilistic Relaxation). The oracle’s feedback is used to update the costs between nodes and arcs of both graphs. We present and validate four different active strategies based on the probability matrix between nodes. It is not needed to modify the code of the graph- matching algorithms, since our method simply needs to read the probability matrix and to update the costs between nodes and arcs. Practical validation shows that with few oracle’s feedbacks, the algorithm finds the labelling that the user considers optimal because imposing few labellings the other ones are corrected automatically.

### Nieves Brisaboa. "Variants of k2-tree for RDF graphs"

**Abstract**: The growing volume of data being generated and published in the Web of Data as RDF datasets makes this data difficult to exchange, index and consume. This scalability problem greatly diminishes the potential of interconnected RDF graphs. Despite being originally designed for Web graphs, k2-tree are also of general interest and can be used to compress other kinds of graphs and data structures, such as RDF graphs. We present some variants of the method that achieve ultra-compressed representations of large RDFgraphs and allow SPARQL queries to be full-in-memory performed without decompression. By using k2-tree for RDF graphs we can outperform the state-of-the-art compressibility and the traditional vertical-partitioning query resolution, remaining very competitive with multi-index solutions.

### Josep Lladós. "Improving Fuzzy Multilevel Graph Embedding through Feature Selection Technique"

**Abstract**: Graphs are the most powerful, expressive and convenient data structures but there is a lack of efficient computational tools and algorithms for processing them. The embedding of graphs into numeric vector spaces permits them to access the state-of-the-art computational effcient statistical models and tools. In this work we take forward our work on explicit graph embedding and present an improvement to our earlier proposed method, named "fuzzy multilevel graph embedding - FMGE" through feature selection technique. FMGE achieves the embedding of attributed graphs into low dimensional vector spaces by performing a multilevel analysis of graphs and extracting a set of global, structural and elementary level features. Feature selection permits FMGE to select the subset of most discriminating features and to discard the confusing ones for underlying graph dataset. Experimental results for graph classication experimentation on IAM letter, GREC and fingerprint graph databases, show improvement in the performance of FMGE.

### David Domínguez. "Shaping communities out of triangles"

**Abstract**: Community detection has arisen as one of the most relevant topics in the field of graph data mining due to its importance in many fields such as biology, social networks or network traffic analysis. The metrics proposed to shape communities are generic and follow two approaches: maximizing the internal density of such communities or reducing the connectivity of the internal vertices with those outside the community. However, these metrics take the edges as a set and do not consider the internal layout of the edges in the community. We define a set of properties oriented to social networks that ensure that communities are cohesive, structured and well defined. Then, we propose the Weighted Community Clustering (WCC), which is a community metric based on triangles. We proof that analyzing communities by triangles gives communities that fulfill the listed set of properties, in contrast to previous metrics. Finally, we experimentally show that WCC correctly captures the concept of community in social networks using real and synthetic datasets, and compare statistically some of the most relevant community detection algorithms in the state of the art.

### Joan Vilatella. "A Simple Heuristic Algorithm for Fast Edge-coloring of Large Graphs"

**Abstract**: Coloring the edges of a graph is a problem of practical interest but hard for very large graphs. We present a relatively simple algorithm, easy to describe and implement, allowing us to edge-color large graphs (up to millions of edges) with a good empirical efficiency, specially if the maximum degree is small. The basic idea of the algorithm is to solve conflicts successively, where a graph node is considered to be a conflicting one if its incident edges are unproperly colored.

*Joint work with Miquel Àngel Fiol.*

### David Francis Nettleton. "Some Issues related to the Mining of OSNs represented as Graphs"

**Abstract**: This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include the elicitation of a community structure, finding similar subgraphs and computational cost issues, among others.