Lei Chen The Hong Kong University of Science and Technology, Hong Kong,SAR China, and Prof.. Zhiguo Gong University of Macau, SAR ChinaQing Li City University of Hong Kong, SAR ChinaKam-
Trang 2Lecture Notes in Computer Science 10987
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Trang 3More information about this series at http://www.springer.com/series/7409
Trang 4Yi Cai • Yoshiharu Ishikawa
Jianliang Xu (Eds.)
Web and Big Data
Second International Joint Conference, APWeb-WAIM 2018 Macau, China, July 23 –25, 2018
Proceedings, Part I
123
Trang 5Lecture Notes in Computer Science
ISBN 978-3-319-96889-6 ISBN 978-3-319-96890-2 (eBook)
https://doi.org/10.1007/978-3-319-96890-2
Library of Congress Control Number: 2018948814
LNCS Sublibrary: SL3 – Information Systems and Applications, incl Internet/Web, and HCI
© Springer International Publishing AG, part of Springer Nature 2018
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speci fically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on micro films or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional af filiations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Trang 6This volume (LNCS 10987) and its companion volume (LNCS 10988) contain theproceedings of the second Asia-Pacific Web (APWeb) and Web-Age InformationManagement (WAIM) Joint Conference on Web and Big Data, called APWeb-WAIM.This joint conference aims to attract participants from different scientific communities
as well as from industry, and not merely from the Asia Pacific region, but also fromother continents The objective is to enable the sharing and exchange of ideas, expe-riences, and results in the areas of World Wide Web and big data, thus covering Webtechnologies, database systems, information management, software engineering, andbig data The second APWeb-WAIM conference was held in Macau during July
23–25, 2018 As an Asia-Pacific flagship conference focusing on research, ment, and applications in relation to Web information management, APWeb-WAIMbuilds on the successes of APWeb and WAIM: APWeb was previously held in Beijing(1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou(2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou(2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha(2014), Guangzhou (2015), and Suzhou (2016); and WAIM was held in Shanghai(2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou(2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009),Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014),Qingdao (2015), and Nanchang (2016) Thefirst joint APWeb-WAIM conference washeld in Bejing (2017) With the fast development of Web-related technologies, weexpect that APWeb-WAIM will become an increasingly popular forum that bringstogether outstanding researchers and developers in the field of the Web and big datafrom around the world The high-quality program documented in these proceedingswould not have been possible without the authors who chose APWeb-WAIM fordisseminating their findings Out of 168 submissions, the conference accepted 39regular (23.21%), 31 short research papers, and six demonstrations The contributedpapers address a wide range of topics, such as text analysis, graph data processing,social networks, recommender systems, information retrieval, data streams, knowledgegraph, data mining and application, query processing, machine learning, database andWeb applications, big data, and blockchain The technical program also includedkeynotes by Prof Xuemin Lin (The University of New South Wales, Australia),Prof Lei Chen (The Hong Kong University of Science and Technology, Hong Kong,SAR China), and Prof Ninghui Li (Purdue University, USA) as well as industrialinvited talks by Dr Zhao Cao (Huawei Blockchain) and Jun Yan (YiDu Cloud) Weare grateful to these distinguished scientists for their invaluable contributions to theconference program As a joint conference, teamwork was particularly important forthe success of APWeb-WAIM We are deeply thankful to the Program Committeemembers and the external reviewers for lending their time and expertise to the con-ference Special thanks go to the local Organizing Committee led by Prof Zhiguo Gong
Trang 7develop-Thanks also go to the workshop co-chairs (Leong Hou U and Haoran Xie), democo-chairs (Zhixu Li, Zhifeng Bao, and Lisi Chen), industry co-chair (Wenyin Liu), tutorialco-chair (Jian Yang), panel chair (Kamal Karlapalem), local arrangements chair(Derek Fai Wong), and publicity co-chairs (An Liu, Feifei Li, Wen-Chih Peng, andLadjel Bellatreche) Their efforts were essential to the success of the conference Lastbut not least, we wish to express our gratitude to the treasurer (Andrew Shibo Jiang),the Webmaster (William Sio) for all the hard work, and to our sponsors who generouslysupported the smooth running of the conference We hope you enjoy the excitingprogram of APWeb-WAIM 2018 as documented in these proceedings.
Jianliang XuYoshiharu Ishikawa
Trang 8Zhiguo Gong University of Macau, SAR China
Qing Li City University of Hong Kong, SAR ChinaKam-fai Wong Chinese University of Hong Kong, SAR China
Leong Hou U University of Macau, SAR China
Haoran Xie Education University of Hong Kong, SAR China
Demo Co-chairs
Lisi Chen Wollongong University, Australia
Trang 9Wen-Chih Peng National Taiwan University, China
Ladjel Bellatreche ISAE-ENSMA, Poitiers, France
Treasurers
Leong Hou U University of Macau, SAR China
Andrew Shibo Jiang Macau Convention and Exhibition Association,
SAR China
Local Arrangements Chair
Derek Fai Wong University of Macau, SAR China
Webmaster
William Sio University of Macau, SAR China
Senior Program Committee
Byron Choi Hong Kong Baptist University, SAR China
Christian Jensen Aalborg University, Denmark
Demetrios
Zeinalipour-Yazti
University of Cyprus, Cyprus
Guoliang Li Tsinghua University, China
K Selçuk Candan Arizona State University, USA
Kyuseok Shim Seoul National University, South Korea
Makoto Onizuka Osaka University, Japan
Reynold Cheng The University of Hong Kong, SAR China
Toshiyuki Amagasa University of Tsukuba, Japan
Wang-Chien Lee Pennsylvania State University, USA
Wen-Chih Peng National Chiao Tung University, Taiwan
Wook-Shin Han Pohang University of Science and Technology, South KoreaXiaokui Xiao National University of Singapore, Singapore
Ying Zhang University of Technology Sydney, Australia
Program Committee
Alex Thomo University of Victoria, Canada
Baoning Niu Taiyuan University of Technology, China
Bo Tang Southern University of Science and Technology, ChinaZouhaier Brahmia University of Sfax, Tunisia
Carson Leung University of Manitoba, Canada
Cheng Long Queen’s University Belfast, UK
VIII Organization
Trang 10Chih-Chien Hung Tamkang University, China
Chih-Hua Tai National Taipei University, China
Cuiping Li Renmin University of China, China
Daniele Riboni University of Cagliari, Italy
Defu Lian Big Data Research Center, University of Electronic
Science and Technology of China, ChinaDejing Dou University of Oregon, USA
Dimitris Sacharidis Technische Universität Wien, Austria
Ganzhao Yuan Sun Yat-sen University, China
Giovanna Guerrini Università di Genova, Italy
Guanfeng Liu The University of Queensland, Australia
Guoqiong Liao Jiangxi University of Finance and Economics, ChinaGuanling Lee National Dong Hwa University, China
Haibo Hu Hong Kong Polytechnic University, SAR ChinaHailong Sun Beihang University, China
Han Su University of Southern California, USA
Haoran Xie The Education University of Hong Kong, SAR ChinaHiroaki Ohshima University of Hyogo, Japan
Hong Chen Renmin University of China, China
Hongyan Liu Tsinghua University, China
Hongzhi Wang Harbin Institute of Technology, China
Hongzhi Yin The University of Queensland, Australia
Hua Wang Victoria University, Australia
Ilaria Bartolini University of Bologna, Italy
James Cheng Chinese University of Hong Kong, SAR ChinaJeffrey Xu Yu Chinese University of Hong Kong, SAR ChinaJiajun Liu Renmin University of China, China
Jialong Han Nanyang Technological University, SingaporeJianbin Huang Xidian University, China
Jian Yin Sun Yat-sen University, China
Jiannan Wang Simon Fraser University, Canada
Jianting Zhang City College of New York, USA
Jianxin Li Beihang University, China
Jianzhong Qi University of Melbourne, Australia
Jinchuan Chen Renmin University of China, China
Junhu Wang Griffith University, Australia
Kai Zheng University of Electronic Science and Technology
of China, ChinaKarine Zeitouni Université de Versailles Saint-Quentin, France
Leong Hou U University of Macau, SAR China
Lianghuai Yang Zhejiang University of Technology, China
Organization IX
Trang 11Lisi Chen Wollongong University, Australia
Maria Damiani University of Milan, Italy
Markus Endres University of Augsburg, Germany
Mihai Lupu Vienna University of Technology, Austria
Mirco Nanni ISTI-CNR Pisa, Italy
Mizuho Iwaihara Waseda University, Japan
Peiquan Jin University of Science and Technology of China, China
Qin Lu University of Technology Sydney, Australia
Ralf Hartmut Güting Fernuniversität in Hagen, Germany
Raymond Chi-Wing Wong Hong Kong University of Science and Technology,
SAR ChinaRonghua Li Shenzhen University, China
Rui Zhang University of Melbourne, Australia
Sanghyun Park Yonsei University, South Korea
Sanjay Madria Missouri University of Science and Technology, USAShaoxu Song Tsinghua University, China
Shengli Wu Jiangsu University, China
Shimin Chen Chinese Academy of Sciences, China
Shuo Shang King Abdullah University of Science and Technology,
Saudi ArabiaTakahiro Hara Osaka University, Japan
Tieyun Qian Wuhan University, China
Tingjian Ge University of Massachusetts, Lowell, USA
Tom Z J Fu Advanced Digital Sciences Center, Singapore
Tru Cao Ho Chi Minh City University of Technology, VietnamVincent Oria New Jersey Institute of Technology, USA
Wee Ng Institute for Infocomm Research, Singapore
Wei Wang University of New South wales, Australia
Weining Qian East China Normal University, China
Weiwei Sun Fudan University, China
Wolf-Tilo Balke Technische Universität Braunschweig, GermanyWookey Lee Inha University, South Korea
Xiang Zhao National University of Defence Technology, ChinaXiang Lian Kent State University, USA
Xiangliang Zhang King Abdullah University of Science and Technology,
Saudi ArabiaXiangmin Zhou RMIT University, Australia
Xiaochun Yang Northeast University, China
Xiaofeng He East China Normal University, China
Xiaohui (Daniel) Tao The University of Southern Queensland, AustraliaXiaoyong Du Renmin University of China, China
Xike Xie University of Science and Technology of China, China
Trang 12Xin Cao The University of New South Wales, AustraliaXin Huang Hong Kong Baptist University, SAR China
Xingquan Zhu Florida Atlantic University, USA
Xuan Zhou Renmin University of China, China
Yanghua Xiao Fudan University, China
Yanghui Rao Sun Yat-sen University, China
Yang-Sae Moon Kangwon National University, South Korea
Yaokai Feng Kyushu University, Japan
Yi Cai South China University of Technology, ChinaYijie Wang National University of Defense Technology, ChinaYingxia Shao Peking University, China
Yongxin Tong Beihang University, China
Yuan Fang Institute for Infocomm Research, Singapore
Yunjun Gao Zhejiang University, China
Zakaria Maamar Zayed University, United Arab of Emirates
Zhaonian Zou Harbin Institute of Technology, China
Zhiwei Zhang Hong Kong Baptist University, SAR China
Organization XI
Trang 13Keynotes
Trang 14Graph Processing: Applications, Challenges,
and Advances
Xuemin Lin
School of Computer Science and Engineering,
University of New South Wales, Sydneylxue@cse.unsw.edu.au
Abstract.Graph data are key parts of Big Data and widely used for modellingcomplex structured data with a broad spectrum of applications Over the lastdecade, tremendous research efforts have been devoted to many fundamentalproblems in managing and analyzing graph data In this talk, I will cover variousapplications, challenges, and recent advances We will also look to the future
of the area
Trang 15Differential Privacy in the Local Setting
of the art of LDP We survey recent developments for LDP, and discuss tocols for estimating frequencies of different values under LDP, and for com-puting marginal when each user has multiple attributes Finally, we discusslimitations and open problems of LDP
Trang 16pro-Big Data, AI, and HI, What is the Next?
Lei Chen
Department of Computer Science and Engineering, Hong Kong University
of Science and Technologyleichen@cse.ust.hk
Abstract.Recently, AI has become quite popular and attractive, not only to theacademia but also to the industry The successful stories of AI on Alpha-go andTexas hold ’em games raise significant public interests on AI Meanwhile,human intelligence is turning out to be more sophisticated, and Big Datatechnology is everywhere to improve our life quality The question we all want
to ask is“what is the next?” In this talk, I will discuss about DHA, a newcomputing paradigm, which combines big Data, Human intelligence, and AI.First I will briefly explain the motivation of DHA Then I will present somechallenges and possible solutions to build this new paradigm
Trang 17Similarity Calculations of Academic Articles Using Topic Events
and Domain Knowledge 45Ming Liu, Bo Lang, and Zepeng Gu
Sentiment Classification via Supplementary Information Modeling 54Zenan Xu, Yetao Fu, Xingming Chen, Yanghui Rao, Haoran Xie,
Fu Lee Wang, and Yang Peng
Training Set Similarity Based Parameter Selection for Statistical
Machine Translation 63Xuewen Shi, Heyan Huang, Ping Jian, and Yi-Kun Tang
Matrix Factorization Meets Social Network Embedding for Rating
Prediction 121Menghao Zhang, Binbin Hu, Chuan Shi, Bin Wu, and Bai Wang
An Estimation Framework of Node Contribution Based
on Diffusion Information 130Zhijian Zhang, Ling Liu, Kun Yue, and Weiyi Liu
Trang 18Multivariate Time Series Clustering via Multi-relational Community
Detection in Networks 138Guowang Du, Lihua Zhou, Lizhen Wang, and Hongmei Chen
Recommender Systems
NSPD: An N-stage Purchase Decision Model
for E-commerce Recommendation 149Cairong Yan, Yan Huang, Qinglong Zhang, and Yan Wan
Social Image Recommendation Based on Path Relevance 165Zhang Chuanyan, Hong Xiaoguang, and Peng Zhaohui
Representation Learning with Depth and Breadth for Recommendation
Using Multi-view Data 181Xiaotian Han, Chuan Shi, Lei Zheng, Philip S Yu, Jianxin Li,
UIContextListRank: A Listwise Recommendation Model
with Social Contextual Information 207Zhenhua Huang, Chang Yu, Jiujun Cheng, and Zhixiao Wang
and Gang Wang
LIDH: An Efficient Filtering Method for Approximate k Nearest
Neighbor Queries Based on Local Intrinsic Dimension 268Yang Song, Yu Gu, and Ge Yu
Query Performance Prediction and Classification for Information
Search Systems 277Zhongmin Zhang, Jiawei Chen, and Shengli Wu
XX Contents– Part I
Trang 19Aggregate Query Processing on Incomplete Data 286Anzhen Zhang, Jinbao Wang, Jianzhong Li, and Hong Gao
Machine Learning
Travel Time Forecasting with Combination of Spatial-Temporal
and Time Shifting Correlation in CNN-LSTM Neural Network 297Wenjing Wei, Xiaoyi Jia, Yang Liu, and Xiaohui Yu
DMDP2: A Dynamic Multi-source Based Default Probability
Prediction Framework 312
Yi Zhao, Yong Huang, and Yanyan Shen
Brain Disease Diagnosis Using Deep Learning Features from Longitudinal
MR Images 327Linlin Gao, Haiwei Pan, Fujun Liu, Xiaoqin Xie, Zhiqiang Zhang,
Jinming Han, and the Alzheimer’s Disease Neuroimaging Initiative
Attention-Based Recurrent Neural Network for Sequence Labeling 340Bofang Li, Tao Liu, Zhe Zhao, and Xiaoyong Du
Haze Forecasting via Deep LSTM 349Fan Feng, Jikai Wu, Wei Sun, Yushuang Wu, HuaKang Li,
and Xingguo Chen
Importance-Weighted Distance Aware Stocks Trend Prediction 357Zherong Zhang, Wenge Rong, Yuanxin Ouyang, and Zhang Xiong
Knowledge Graph
Jointly Modeling Structural and Textual Representation for Knowledge
Graph Completion in Zero-Shot Scenario 369Jianhui Ding, Shiheng Ma, Weijia Jia, and Minyi Guo
Neural Typing Entities in Chinese-Pedia 385Yongjian You, Shaohua Zhang, Jiong Lou, Xinsong Zhang,
and Weijia Jia
Knowledge Graph Embedding by Learning to Connect Entity with Relation 400Zichao Huang, Bo Li, and Jian Yin
StarMR: An Efficient Star-Decomposition Based Query Processor for
SPARQL Basic Graph Patterns Using MapReduce 415Qiang Xu, Xin Wang, Jianxin Li, Ying Gan, Lele Chai, and Junhu Wang
DAVE: Extracting Domain Attributes and Values from Text Corpus 431Yongxin Shen, Zhixu Li, Wenling Zhang, An Liu, and Xiaofang Zhou
Contents– Part I XXI
Trang 20PRSPR: An Adaptive Framework for Massive RDF Stream Reasoning 440Guozheng Rao, Bo Zhao, Xiaowang Zhang, Zhiyong Feng,
and Guohui Xiao
Demo Papers
TSRS: Trip Service Recommended System Based on Summarized
Co-location Patterns 451Peizhong Yang, Tao Zhang, and Lizhen Wang
DFCPM: A Dominant Feature Co-location Pattern Miner 456Yuan Fang, Lizhen Wang, Teng Hu, and Xiaoxuan Wang
CUTE: Querying Knowledge Graphs by Tabular Examples 461Zichen Wang, Tian Li, Yingxia Shao, and Bin Cui
ALTAS: An Intelligent Text Analysis System Based on Knowledge Graphs 466Xiaoli Wang, Chuchu Gao, Jiangjiang Cao, Kunhui Lin, Wenyuan Du,
and Zixiang Yang
SPARQLVis: An Interactive Visualization Tool for Knowledge Graphs 471Chaozhou Yang, Xin Wang, Qiang Xu, and Weixi Li
PBR: A Personalized Book Resource Recommendation System 475Yajie Zhu, Feng Xiong, Qing Xie, Lin Li, and Yongjian Liu
Author Index 481
XXII Contents– Part I
Trang 21Contents – Part II
Database and Web Applications
Fuzzy Searching Encryption with Complex Wild-Cards Queries
on Encrypted Database 3
He Chen, Xiuxia Tian, and Cheqing Jin
Towards Privacy-Preserving Travel-Time-First Task Assignment
in Spatial Crowdsourcing 19Jian Li, An Liu, Weiqi Wang, Zhixu Li, Guanfeng Liu, Lei Zhao,
and Kai Zheng
Plover: Parallel In-Memory Database Logging on Scalable Storage Devices 35Huan Zhou, Jinwei Guo, Ouya Pei, Weining Qian, Xuan Zhou,
and Aoying Zhou
Inferring Regular Expressions with Interleaving from XML Data 44Xiaolan Zhang, Yeting Li, Fei Tian, Fanlin Cui, Chunmei Dong,
and Haiming Chen
Efficient Query Reverse Engineering for Joins
and OLAP-Style Aggregations 53Wei Chit Tan
DCA: The Advanced Privacy-Enhancing Schemes
for Location-Based Services 63Jiaxun Hua, Yu Liu, Yibin Shen, Xiuxia Tian, and Cheqing Jin
Data Streams
Discussion on Fast and Accurate Sketches for Skewed Data Streams:
A Case Study 75Shuhao Sun and Dagang Li
Matching Consecutive Subpatterns over Streaming Time Series 90Rong Kang, Chen Wang, Peng Wang, Yuting Ding, and Jianmin Wang
A Data Services Composition Approach for Continuous Query
on Data Streams 106Guiling Wang, Xiaojiang Zuo, Marc Hesenius, Yao Xu, Yanbo Han,
and Volker Gruhn
Trang 22Discovering Multiple Time Lags of Temporal Dependencies from
Fluctuating Events 121Wentao Wang, Chunqiu Zeng, and Tao Li
A Combined Model for Time Series Prediction in Financial Markets 138Hongbo Sun, Chenkai Guo, Jing Xu, Jingwen Zhu, and Chao Zhang
Data Mining and Application
Location Prediction in Social Networks 151Rong Liu, Guanglin Cong, Bolong Zheng, Kai Zheng, and Han Su
Efficient Longest Streak Discovery in Multidimensional Sequence Data 166Wentao Wang, Bo Tang, and Min Zhu
Map Matching Algorithms: An Experimental Evaluation 182
Na Ta, Jiuqi Wang, and Guoliang Li
Predicting Passenger’s Public Transportation Travel Route Using Smart
Card Data 199Chen Yang, Wei Chen, Bolong Zheng, Tieke He, Kai Zheng, and Han Su
Detecting Taxi Speeding from Sparse and Low-Sampled Trajectory Data 214Xibo Zhou, Qiong Luo, Dian Zhang, and Lionel M Ni
Cloned Vehicle Behavior Analysis Framework 223Minxi Li, Jiali Mao, Xiaodong Qi, Peisen Yuan, and Cheqing Jin
An Event Correlation Based Approach to Predictive Maintenance 232Meiling Zhu, Chen Liu, and Yanbo Han
Using Crowdsourcing for Fine-Grained Entity Type Completion in
Knowledge Bases 248Zhaoan Dong, Ju Fan, Jiaheng Lu, Xiaoyong Du, and Tok Wang Ling
Improving Clinical Named Entity Recognition with Global
Neural Attention 264Guohai Xu, Chengyu Wang, and Xiaofeng He
Exploiting Implicit Social Relationship for Point-of-Interest
Recommendation 280Haifeng Zhu, Pengpeng Zhao, Zhixu Li, Jiajie Xu, Lei Zhao,
and Victor S Sheng
Spatial Co-location Pattern Mining Based on Density Peaks Clustering
and Fuzzy Theory 298Yuan Fang, Lizhen Wang, and Teng Hu
XXIV Contents– Part II
Trang 23A Tensor-Based Method for Geosensor Data Forecasting 306Lihua Zhou, Guowang Du, Qing Xiao, and Lizhen Wang
Keyphrase Extraction Based on Optimized Random Walks
on Multiple Word Relations 359Wenyan Chen, Zheng Liu, Wei Shi, and Jeffrey Xu Yu
Answering Range-Based Reverse kNN Queries 368Zhefan Zhong, Xin Lin, Liang He, and Yan Yang
Big Data and Blockchain
EarnCache: Self-adaptive Incremental Caching for Big Data Applications 379Yifeng Luo, Junshi Guo, and Shuigeng Zhou
Storage and Recreation Trade-Off for Multi-version Data Management 394Yin Zhang, Huiping Liu, Cheqing Jin, and Ye Guo
Decentralized Data Integrity Verification Model in Untrusted Environment 410Kun Hao, Junchang Xin, Zhiqiong Wang, Zhuochen Jiang,
and Guoren Wang
Enabling Concurrency on Smart Contracts Using Multiversion Ordering 425
An Zhang and Kunlong Zhang
ElasticChain: Support Very Large Blockchain by Reducing
Data Redundancy 440Dayu Jia, Junchang Xin, Zhiqiong Wang, Wei Guo, and Guoren Wang
A MapReduce-Based Approach for Mining Embedded Patterns
from Large Tree Data 455Wen Zhao and Xiaoying Wu
Author Index 463
Contents– Part II XXV
Trang 24Text Analysis
Trang 25Abstractive Summarization with the Aid
of Extractive Summarization
Yangbin Chen(B), Yun Ma, Xudong Mao, and Qing Li
City University of Hong Kong, Hong Kong SAR, China
{robinchen2-c,yunma3-c,xdmao2-c}@my.cityu.edu.hk,
qing.li@cityu.edu.hk
Abstract Currently the abstractive method and extractive method are
two main approaches for automatic document summarization To fullyintegrate the relatedness and advantages of both approaches, we pro-pose in this paper a general framework for abstractive summarizationwhich incorporates extractive summarization as an auxiliary task Inparticular, our framework is composed of a shared hierarchical docu-ment encoder, an attention-based decoder for abstractive summarization,and an extractor for sentence-level extractive summarization Learn-ing these two tasks jointly with the shared encoder allows us to bet-ter capture the semantics in the document Moreover, we constrain theattention learned in the abstractive task by the salience estimated inthe extractive task to strengthen their consistency Experiments on theCNN/DailyMail dataset demonstrate that both the auxiliary task andthe attention constraint contribute to improve the performance signifi-cantly, and our model is comparable to the state-of-the-art abstractivemodels
Keywords: Abstractive document summarization
Squence-to-sequence·Joint learning
1 Introduction
Automatic document summarization has been studied for decades The target
of document summarization is to generate a shorter passage from the ment in a grammatically and logically coherent way, meanwhile preserving theimportant information There are two main approaches for document summa-rization: extractive summarization and abstractive summarization The extrac-tive method first extracts salient sentences or phrases from the source documentand then groups them to produce a summary without changing the source text.Graph-based ranking model [1,2] and feature-based classification model [3,4] aretypical models for extractive summarization However, the extractive methodunavoidably includes secondary or redundant information and is far from theway humans write summaries [5]
docu-c
Springer International Publishing AG, part of Springer Nature 2018
Y Cai et al (Eds.): APWeb-WAIM 2018, LNCS 10987, pp 3–15, 2018.
Trang 264 Y Chen et al.
The abstractive method, in contrast, produces generalized summaries, veying information in a concise way, and eliminating the limitations to the orig-inal words and sentences of the document This task is more challenging since
con-it needs advanced language generation and compression techniques Discoursestructures [6,7] and semantics [8,9] are most commonly used by researchers forgenerating abstractive summaries
Recently, Recurrent Neural Network (RNN)-based sequence-to-sequencemodel with attention mechanism has been applied to abstractive summariza-tion, due to its great success in machine translation [22,27,30] However, thereare still some challenges First, the RNN-based models have difficulties in cap-turing long-term dependencies, making summarization for long document muchtougher Second, different from machine translation which has strong correspon-dence between the source and target words, an abstractive summary corresponds
to only a small part of the source document, making its attention difficult to belearned
We adopt hierarchical approaches for the long-term dependency problem,which have been used in many tasks such as machine translation and docu-ment classification [10,11] But few of them have been applied to the abstractivesummarization tasks In particular, we encode the input document in a hierar-chical way from word-level to sentence-level There are two advantages First, itcaptures both the local and global semantic representations, resulting in betterfeature learning Second, it improves the training efficiency because the timecomplexity of the RNN-based model can be reduced by splitting the long docu-ment into short sentences
The attention mechanism is widely used in sequence-to-sequence tasks [13,
27] However, for abstractive summarization, it is difficult to learn the attentionsince only a small part of the source document is important to the summary
In this paper, we propose two methods to learn a better attention distribution.First, we use a hierarchical attention mechanism, which means that the attention
is applied in both word and sentence levels Similar to the hierarchical approach
in encoding, the advantage of using hierarchical attention is to capture both thelocal and global semantic representations Second, we use the salience scores ofthe auxiliary task (i.e., the extractive summarization) to constrain the sentence-level attention
In this paper, we present a novel technique for abstractive summarizationwhich incorporates extractive summarization as an auxiliary task Our frame-work consists of three parts: a shared document encoder, a hierarchical attention-based decoder and an extractor As Fig.1shows, we encode the document in ahierarchical way (Fig.1 (1) and (2)) in order to address the long-term depen-dency problem Then the learned document representations are shared by theextractor (Fig.1 (3)) and the hierarchical attention-based decoder (Fig.1 (5)).The extractor and the decoder are jointly trained which can capture bettersemantics of the document Furthermore, as both the sentence salience scores
in the extractor and the sentence-level attention in the decoder indicate the
Trang 27Abstractive Summarization with the Aid of Extractive Summarization 5
Fig 1 General framework of our proposed model with 5 components: (1) word-level
encoder encodes the sentences word-by-word independently, (2) sentence- level encoderencodes the document sentence-by-sentence, (3) sentence extractor makes binary clas-sification for each sentence, (4) hierarchical attention calculates the word-level andsentence-level context vectors for decoding steps, (5) decoder decodes the outputsequential word sequence with a beam-search algorithm
importance of source sentences, we constrain the learned attention (Fig.1 (4))with the extracted sentence salience in order to strengthen their consistency
We have conducted experiments on a news corpus - the CNN/DailyMaildataset [16] The results demonstrate that adding the auxiliary extractive taskand constraining the attention are both useful to improve the performance ofthe abstractive task, and our proposed joint model is comparable to the state-of-the-art abstractive models
2 Neural Summarization Model
In this section we describe the framework of our proposed model which consists
of five components As illustrated in Fig.1, the hierarchical document encoderwhich includes both the word-level and the sentence-level encoders reads theinput word sequences and generates shared document representations On onehand, the shared representations are fed into the sentence extractor which is
a sequence labeling model to calculate salience scores On the other hand, therepresentations are used to generate abstractive summaries by a GRU-based lan-guage model, with the hierarchical attention including the sentence-level atten-tion and word-level attention Finally, the two tasks are jointly trained
Trang 286 Y Chen et al.
We encode the document in a hierarchical way In particular, the word sequencesare first encoded by a bidirectional GRU network parallelly, and a sequence ofsentence-level vector representations called sentence embeddings are generated.Then the sentence embeddings are fed into another bidirectional GRU networkand get the document representations Such an architecture has two advantages.First, it can reduce the negative effects during the training process caused by thelong-term dependency problem, so that the document can be represented fromboth local and global aspects Second, it helps improve the training efficiency asthe time complexity of RNN-based model increases with the sequence length
Formally, let V denote the vocabulary which contains D tokens, and each token is embedded as a d-dimension vector Given an input document X con-
taining m sentences {X i , i ∈ 1, , m}, let n i denote the number of words in Xi
Word-level Encoder reads a sentence word-by-word until the end, using a
bidirectional GRU network as the following equations:
h w
← −
h w i,j = GRU ( x i,j , ← −
h w
where x i,j represents the embedding vector of the jth word in th ith sentence.
← h w
i,jis a concatenated vector of the forward hidden state− → h w
i,jand the backwardhidden state← h − w
i,j H is the size of the hidden state.
Furthermore, the ith sentence is represented by a non-linear transformation
of the word-level hidden states as follows:
wheres i is the sentence embedding and W,b are learnable parameters.
Sentence-level Encoder reads a document sentence-by-sentence until the
end, using another bi-directional GRU network as depicted by the followingequations:
i.The concatenated vectors←
h s
i are document representations shared by thetwo tasks which will be introduced next
Trang 29Abstractive Summarization with the Aid of Extractive Summarization 7
The sentence extractor can be viewed as a sequential binary classifier We use alogistic function to calculate a score between 0 and 1, which is an indicator ofwhether or not to keep the sentence in the final summary The score can also
be considered as the salience of a sentence in the document Letp i denote thescore and q i ∈ {0, 1} denote the result of whether or not to keep the sentence.
In particular,p i is calculated as follows:
where Wextr is the weight and b extr is the bias which can be learned
The sentence extractor generates a sequence of probabilities indicating theimportance of the sentences As a result, the extractive summary is created by
selecting sentences with a probability larger than a given threshold τ We set
τ = 0.5 in our experiment We choose the cross entropy as the extractive loss
t is the sentence-level context vector and cw
t is the word-level context
vector at decoding time step t Specifically, α t,i denotes the attention value on
the ith sentence and β t,i,j denotes the attention value on the jth word of the ith sentence.
The input of the GRU-based language model at decoding time step t
con-tains three vectors: the word embedding of previous generated word ˆy t−1, thesentence-level context vector of previous time stepc s
t−1 and the word-level text vector of previous time stepc w
con-t−1 They are transformed by a linear functionand fed into the language model as follows:
˜
h t = GRU ( ˜ h t−1 , f in(ˆy t−1 , c s
t−1 , c w
Trang 308 Y Chen et al.
where ˜h t is the hidden state of decoding time step t f in is the linear
transfor-mation function with Wdec as the weight and b dec as the bias
The hidden states of the language model are used to generate the outputword sequence The conditional probability distribution over the vocabulary in
the tth time step is:
P (ˆ y t |ˆy1, , ˆ y t−1 , x) = g(f out( ˜h t , c s
t , c w
where g is the softmax function and f out is a linear function with Wsof t and
b sof t as learnable parameters
The negative log likelihood loss is applied as the loss of the decoder, i.e.,
α t,i= e s
t,i
m
k=1 e s t,k
where Vs, Wdec1 , Ws1 andb s
1 are learnable parameters
The word-level attention indicates the salience distribution over the sourcewords As the hierarchical encoder reads the input sentences independently, ourmodel has two distinctions First, the word-level attention is calculated within
a sentence Second, we multiply the word-level attention by the sentence-levelattention of the sentence which the word belongs to The word-level attentioncalculation is shown below:
β t,i,j=α t,i
e w t,i,j
n i
l=1 e w t,i,l
where Vw, Wdec2 , Ww2 andb w
2 are learnable parameters for the word-level tion calculation
atten-The abstractive summary of a long document can be viewed as a new sion of the most salient sentences of the document, so that a well-learned sentenceextractor and a well-learned attention distribution should both be able to detect
Trang 31expres-Abstractive Summarization with the Aid of Extractive Summarization 9
the important sentences of the source document Motivated by this, we design aconstraint to the sentence-level attention which is an L2 loss as follows:
The parameters are trained to minimize the joint loss function In the ence stage, we use the beam search algorithm to select the word which approxi-mately maximizes the conditional probability [17,18,28]
Table 1 The statistics of the CNN/DailyMail dataset S.S.N indicates the average
number of sentences in the source document S.S.L indicates the average length of thesentences in the source document T.S.L indicates the average length of the sentences
in the target summary
CNN/DailyMail 277,554 13,367 11,443 26.9 27.3 53.8
Trang 3210 Y Chen et al.
In our implementation, we set the vocabulary size D to be 50 K and word ding size d as 300 The word embeddings have not been pretrained as the training
embed-corpus is large enough to train them from scratch We cut off the documents as
a maximum of 35 sentences and truncate the sentences with a maximum of 50words We also truncate the targeted summaries with a maximum of 100 words.The word-level encoder and the sentence-level encoder each corresponds a layer
of bidirectional GRU, and the decoder also is a layer of unidirectional GRU
All the three networks have the hidden size H as 200 For the loss function, λ
is set as 100 and γ is set as 0.5 During the training process, we use Adagrad
optimizer [31] with the learning rate of 0.15 and initial accumulator value of0.1 The mini-batch size is 16 We implement the model in Tensorflow and train
it using a GTX-1080Ti GPU The beam search size for decoding is 5 We useROUGE scores [20] to evaluate the summarization models
4 Experimental Results
We compare the full-length Rouge-F1 score on the entire CNN/DailyMail testset We use the fundamental sequence-to-sequence attentional model and thewords-lvt2k-hieratt [13] as baselines The results are shown in Table2
Table 2 Performance comparison of various abstractive models on the entire CNN/DailyMail test set using full- length F1 variants of Rouge.
words-lvt2k-hieratt 35.4 13.3 32.6
From Table2, we can see that our model performs the best in 1,
Rouge-2 and Rouge-L Compared to the vanilla sequence-to-sequence attentional model,our proposed model performs quite better And compared to the hierarchicalmodel, our model performs better in Rouge-L, which is due to the incorporation
of the auxiliary task
To verify the effectiveness of our proposed model, we conduct ablation study byremoving the corresponding parts, i.e the auxiliary extractive task, the attentionconstraint and combination of them in order to make a comparison among their
Trang 33Abstractive Summarization with the Aid of Extractive Summarization 11
Table 3 Performance comparison of removing the components of our proposed model
on the entire CNN/DailyMail test set using full-length F1 variants of Rouge.
a more important role in our framework
We list some examples of the generated summaries of a source document(news)
in Fig.2 The source document contains 24 sentences with totally 660 words.Figure2presents three summaries: a golden summary which is the news highlightwritten by the reporter, the summary generated by our proposed model, and thesummary generated by the sequence-to-sequence attentional model
From the figure we can see that all system-generated summaries are copiedwords from the source document, because the highlights written by reportersused for training are usually partly copied from the source However, differ-ent models have different characteristics As illustrated in Fig.2, all the foursummaries are able to catch several key sentences from the document The fun-damental seq2seq+attn model misses some words like pronouns, which leads togrammatical mistakes in the generated summary Our model without the auxil-iary extractive task is able to detect more salient content, but the concatenatedsentences have some grammatical mistakes and redundant words Our modelwithout the attention constraint generates fluent sentences which are very sim-ilar to the source sentences, but it focuses on just a small part of the sourcedocument The summary generated by our proposed full model is most similar
to the golden summary: it covers as much information and keeps correct mar It does not just copy sentences but use segmentation Moreover, it changesthe order of the source sentences while keeping the logical coherence
Our model has the advantages from three aspects First, summaries ated by our model contain as much important information and perform well
Trang 34gener-12 Y Chen et al.
Fig 2 An example of summaries towards a piece of news From top to down, the
first is the source document which is the raw news content The second is the goldensummary which is used as the ground truth The third is the summary generated byour proposed model The last is the summary generated by the vanilla sequence-to-sequence attentional model
grammatically In practice, it depends on users’ preference between the tion coverage and condensibility to make a suitable balance Compared to thelow recall abstractive methods, our model is able to cover more information Andcompared to the extractive methods, the generated summaries are more coher-ent logically Second, the time complexity of our approach is much less than thebaselines due to hierarchical structures, and our model is trained more quicklycompared to those baselines Third, as our key contribution is to improve theperformance of the main task by incorporating an auxiliary task, in this exper-iment we just use normal GRU-based encoder and decoder for simplicity Morenovel design for the decoder such as the hierarchical decoder can also be appliedand incorporated into our model
informa-5 Related Work
The neural attentional abstractive summarization model was first applied in tence compression [12], where the input sequence is encoded by a convolutionalnetwork and the output sequence is decoded by a standard feedforward NeuralNetwork Language Model (NNLM) Chopra et al [13] and Lopyrev [23] switched
sen-to RNN-type model as the encoder, and did experiments on various values of
Trang 35Abstractive Summarization with the Aid of Extractive Summarization 13
hyper parameters To address the out-of-vocabulary problem, Gu et al [24], Cao
et al [26] and See et al [25] presented the copy mechanism which adds a tion operation between the hidden state and the output layer at each decodingtime step so as to decide whether to generate a new word from the vocabulary
selec-or copy the wselec-ord directly from the source sentence
The sequence-to-sequence model with attention mechanism [13] achievescompetitive performance for sentence compression, but is still a challenge fordocument summarization Some researchers use hierarchical encoder to addressthe long-term dependency problem, yet most of the works are for extractive sum-marization tasks Nallapati et al [29] fed the input word embedding extendedwith new features to the word-level bidirectional GRU network and generatedsequential labels from the sentence-level representations Cheng and Lapata[16] presented a sentence extraction and word extraction model, encoding thesentences independently using Convolutional Neural Networks and decoding abinary sequence for sentence extraction as well as a word sequence for wordextraction Nallapati et al [21] proposed a hierarchical attention with a hierar-chical encoder, in which the word-level attention represents a probability distri-bution over the entire document
Most previous works consider the extractive summarization and abstractivesummarization as two independent tasks The extractive task has the advantage
of preserving the original information, and the abstractive task has the advantage
of generating coherent sentences It is thus reasonable and feasible to combinethese two tasks Tan et al [14] as the first attempt to combine the two, tried touse the extracted sentence scores to calculate the attention for the abstractivedecoder But their proposed model using unsupervised graph-based model torank the sentences is of high computation cost, and incurs long time to train
6 Conclusion
In this work we have presented a sequence-to-sequence model with hierarchicaldocument encoder and hierarchical attention for abstractive summarization, andincorporated extractive summarization as an auxiliary task We jointly train thetwo tasks by sharing the same document encoder The auxiliary task and theattention constraint contribute to improve the performance of the main task.Experiments on the CNN/DailyMail dataset show that our proposed framework
is comparable to the state-of-the-art abstractive models In the future, we willtry to reduce the labels of the auxiliary task and incorporate semi-supervisedand unsupervised methods
Acknowledgements This research has been supported by an innovative technology
fund (project no GHP/036/17SZ) from the Innovation and Technology Commission ofHong Kong, and a donated research project (project no 9220089) at City University
of Hong Kong
Trang 365 Yao, J., Wan, X., Xiao, J.: Recent advances in document summarization Knowl.
Inf Syst 53, 297–336 (2017).https://doi.org/10.1007/s10115-017-1042-4
6 Cheung, J.C.K., Penn, G.: Unsupervised sentence enhancement for automatic marization In: EMNLP, Doha, pp 775–786 (2014)
sum-7 Gerani, S., Mehdad, Y., Carenini, G., Ng, R.T., Nejat, B.: Abstractive tion of product reviews using discourse structure In: EMNLP, Doha, pp 1602–1613(2014)
summariza-8 Fang, Y., Zhu, H., Muszynska, E., Kuhnle, A., Teufel, S.H.: A proposition-basedabstractive summarizer In: COLING, Osaka, pp 567–578 (2016)
9 Liu, F., Flanigan, J., Thomson, S., Sadeh, N., Smith, N.A.: Toward tive summarization using semantic representations In: NAACL-HLT, Denver, pp.1077–1086 (2015)
abstrac-10 Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphsand documents arXiv preprintarXiv:1506.01057(2015)
11 Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical tion networks for document classification In: NAACL-HLT, San Diego, pp 1480–
Trang 37Abstractive Summarization with the Aid of Extractive Summarization 15
22 Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate arXiv preprintarXiv:1409.0473(2014)
23 Lopyrev, K.: Generating news headlines with recurrent neural networks arXivpreprintarXiv:1512.01712(2015)
24 Gu, J., Lu, Z., Li, H., Li, V.O.: Incorporating copying mechanism in sequence learning arXiv preprintarXiv:1603.06393(2016)
sequence-to-25 See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with generator networks arXiv preprintarXiv:1704.04368(2017)
pointer-26 Cao, Z., Luo C., Li, W., Li, S.: Joint copying and restricted generation for phrase In: AAAI, San Francisco, pp 3152–3158 (2017)
para-27 Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-basedneural machine translation arXiv preprint (2015)arXiv:1508.04025
28 Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks In: NIPS, Montreal, pp 3104–3112 (2014)
29 Nallapati, R., Zhai, F., Zhou, B.: Summarunner: a recurrent neural network basedsequence model for extractive summarization of documents In: AAAI, San Fran-cisco, pp 3075–3081 (2017)
30 Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translationsystem: bridging the gap between human and machine translation arXiv preprint
arXiv:1609.08144(2016)
31 Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning
and stochastic optimization JMLR 12, 2121–2159 (2011)
Trang 38Rank-Integrated Topic Modeling:
A General Framework
Zhen Zhang, Ruixuan Li, Yuhua Li(B), and Xiwu Gu
School of Computer Science and Technology,Huazhong University of Science and Technology, Wuhan 430074, China
{zenzang,rxli,idcliyuhua,guxiwu}@hust.edu.cn
Abstract Rank-integrated topic models which incorporate link
struc-tures into topic modeling through topical ranking have shown promisingperformance comparing to other link combined topic models However,existing work on rank-integrated topic modeling treats ranking as doc-ument distribution for topic, and therefore can’t integrate topical rank-ing with LDA model, which is one of the most popular topic models
In this paper, we introduce a new method to integrate topical rankingwith topic modeling and propose a general framework for topic model-ing of documents with link structures By interpreting the normalizedtopical ranking score vectors as topic distributions for documents, wefuse ranking into topic modeling in a general framework Under thisgeneral framework, we construct two rank-integrated PLSA models andtwo rank-integrated LDA models, and present the corresponding learn-ing algorithms We apply our models on four real datasets and comparethem with baseline topic models and the state-of-the-art link combinedtopic models in generalization performance, document classification, doc-ument clustering and topic interpretability Experiments show that allrank-integrated topic models perform better than baseline models, andrank-integrated LDA models outperform all the compared models
Keywords: Normalized topical ranking·Topic distribution
Rank-integrated topic modeling framework
1 Introduction
With the rapid development of online information systems, document networks,i.e information networks associated with text information, are becoming perva-sive in our digital library For example, research papers are linked together viacitations, web pages are connected by hyperlinks and Tweets are connected viasocial relationships To better mine values from documents with link structures,
we study the problem of building topic models of document networks
The most popular topic models include PLSA (Probabilistic Latent SemanticAnalysis) [1] and LDA (Latent Dirichlet Allocation) [2] Traditional topic modelsassume documents are independent with each other and links among them willc
Springer International Publishing AG, part of Springer Nature 2018
Y Cai et al (Eds.): APWeb-WAIM 2018, LNCS 10987, pp 16–31, 2018.
Trang 39Rank-Integrated Topic Modeling: A General Framework 17
not be considered in the modeling process Intuitively, linked documents shouldhave similar semantic information, which can be utilized in topic modeling
To take advantage of link structures in document networks, several topicmodels have been proposed One line of this work is to build unified generativemodels for both texts and links, such as iTopic [3] and RTM [4], and the otherline is to add regularization into topic modeling, such as graph-based regular-izer [5] and rank-based regularizer [6] As a state-of-the-art link combined topicmodel, LIMTopic [7] incorporates link structures into topic modeling throughtopical ranking However, LIMTopic treats topical ranking as document distri-bution for topic which causes that topical ranking can only be combined withsymmetric PLSA model Therefore LIMTopic can not be combined with the pop-ular LDA model To solve this problem, we normalize topical ranking vectorsalong the topic dimension and treat them as topic distributions for documents.Link structures are then fused with text information by iteratively performingtopical ranking and topic modeling in a mutually enhanced framework
In this paper, we propose a general framework for rank-integrated topic eling, which can be integrated with both PLSA and LDA models The maincontributions of this paper are summarized as follows
mod-– A novel approach to integrate topical ranking with topic modeling is proposed,upon which we build a general rank-integrated topic modeling framework fordocument networks
– Under this general framework, we construct two rank-integrated PLSA els, namely RankPLSA and HITSPLSA, and two rank-integrated LDA mod-els, i.e RankLDA and HITSLDA
mod-– Extensive experiments on three publication datasets and one Twitter datasetdemonstrate that rank-integrated topic models perform better than baselinemodels Moreover, rank-integrated LDA models consistently perform betterthan all the compared models
The rest of this paper is organized as follows Section2 reviews the relatedwork, and Sect.3introduces the notations used in topic modeling In Sect.4, wepropose the rank-integrated topic modeling framework and detail the learningalgorithm for rank-integrated PLSA and LDA models Experimental studies arepresented in Sect.5, and we conclude this paper in Sect.6
2 Related Work
Topic modeling algorithms are unsupervised machine learning methods that lyze words of documents to discover themes that run through the corpus anddistributions on these themes for each document PLSA [1] and LDA [2] are twomost well known topic models However, both PLSA and LDA treat documents
ana-in a given corpus as ana-independent to each other Sana-ince their presence, variouskinds of models have been proposed by incorporating contextual informationinto topic modeling, such as time [8] and links [3,4,7,9] Several recent worksintroduce embeddings into topic modeling to improve topic interpretability [10]
Trang 4018 Z Zhang et al.
or reduce computation complexity [11] To better cope with word sparsity, manyshort text-based topic models have been proposed [12] Topic models have alsobeen explored in other research domains, such as recommender system [13] Themost similar work to ours is the LIMTopic framework [7] The distinguished fea-ture of our work is that we treat topical ranking vectors as topic distributions ofdocuments while LIMTopic treats them as document distributions of topics Ourmethod is arguably more flexible and can construct both rank-integrated PLSAand LDA model under a unified framework while LIMTopic can only work withsymmetric PLSA model
Our work is also closely related to ranking technology PageRank and HITS(Hyperlink-Induced Topic Search) are two most popular link based ranking algo-rithms Topical link analysis [14] extends basic PageRank and HITS by comput-ing a score vector for each page to distinguish the contribution from differenttopics Yao et al [15] extend pair-wise ranking models with probabilistic topicmodels and propose a collaborative topic ranking model to alleviate data spar-sity problem in recommender system Ding et al [16] take a topic modelingapproach for preferences ranking by assuming that the preferences of each userare generated from a probabilistic mixture of a few latent global rankings thatare shared across the user population Both of Yao’s and Ding’s models focus onemploying topic modeling to solve ranking problem, while our work incorporateslink structures into topic modeling through ranking
D All the documents in the corpus
D, V, K The number of documents, unique words, topics in the corpus
d Document index in the corpus
N d The number of words in document d
μ The probability of generating specific documents
θ d The topic distribution of document d, expressed by a multinomial
distribution of topics
γ d The topical ranking vector of document d
w dn The nth word in document d, w dn ∈ {1, 2, V }
z dn The topic assignment of word w dn , z dn ∈ {1, 2, K}
β k The multinomial distribution over words specific to topic k
α, η Dirichlet priors to multinomial distribution θ , β