YAN Hongfei:
Associate Professor
Phone: +86-10-62765815-8005
Email: yanhf@pku.edu.cn
Research Interests:
Search Engine
Internet Information Mining
Courses Taught:
CS101 Introduction to Computing ( Fall 2014, Fall 2013, Fall 2012, Fall 2011, Fall 2010, Fall 2008, Fall 2007)
CS202 Data Structure and Algorithm Analysis ( Spring 2014)
CS402 Mass Data Processing/Cloud Computing ( Summer 2013, Summer 2012, Summer 2011, Summer 2010, Summer 2009, Summer 2008)
CS410 Information Retrieval ( Spring 2014, Spring 2013)
CS501 Distributed Systems ( Spring 2012, Spring 2011, Spring 2010, Spring 2009, Spring 2008, Fall 2005, Fall 2004, Fall 2003)
CCF-ADL Information Retrieval ( Summer 2010)
TA for CS410 Information Retrieval: A Special Course in CS, Peking Univ. (Summer 2008), Sponsored by Dragon Star Committee and instructed by Professor Chengxiang Zhai from UIUC.
CS201 Machine Organization and Assembly Programming( Spring 2004, Spring 2003)
TA for CS512 Data Mining: A Special Course in CS, Peking Univ. (Summer 2002), Sponsored by Dragon Star Committee and instructed by Professor Jiawei Han from UIUC.
Profile:
HongfeiYan is currently an associate professor at the Networks and Distributed Systems Laboratory in the Peking University. He received a B.Sc. degree in CS, and a M.Sc. degree in CS from Harbin Engineering University, in 1996 and 1999, respectively. He received a Ph.D. degree in CS from Peking University in 2002. His research interests are centered on Information Retrieval and Distributed System.
Dr. Yan has published more than 60 research papers, and most of them are published in top-tier conferences, such as SIGIR, WSDM, KDD, EMNLP and ACL. He was awarded the second prize of Beijing Science and Technology Progress (2004), and the second prize of China Computer Federation Science and Technology (2016).
Dr. Yan has more than five research projects including NSFC, Core-High-Basic programs ("core electronic devices, high-end general chips and basic software products" National Science and technology major projects), 863 Program, etc.
His research achievements are summarized as follows:
1) Scalable event detection: Mining retrospective events from text streams has been an important research topic. Classic text representation model (i.e., vector space model) cannot model temporal aspects of documents. To address it, he proposed a novel burst-based text representation model, denoted as BurstVSM. BurstVSM corresponds dimensions to bursty features instead of terms, which can capture semantic and temporal information. Meanwhile, it significantly reduces the number of non-zero entries in the representation. He test it via scalable event detection, and experiments in a 10-year news archive show that his methods are both effective and efficient.
2) Event discovery and retrieval on multi-type historical data: He present EventSearch, a system for event extraction and retrieval on four types of news-related historical data, i.e., Web news articles, newspapers, TV news program, and micro-blog short messages. The system incorporates over 11 million web pages extracted from "Web InfoMall", the Chinese Web Archive since 2001. The newspaper and TV news video clips also span from 2001 to 2011. The system, upon a user query, returns a list of event snippets from multiple data sources. A novel burst model is used to discover events from time-stamped texts. In addition to offline event extraction, his system also provides online event extraction to further meet the user needs. EventSearch provides meaningful analytics that synthesize an accurate description of events. Users interact with the system by ranking the identified events using different criteria (scale, recency and relevance) and submitting their own information needs in different input fields.
Architectural design and evaluation of an efficient Web-crawling system: He presents an architectural design and evaluation result of an efficient Web-crawling system. The design involves a fully distributed architecture, a URL allocating algorithm, and a method to assure system scalability and dynamic reconfigurability. Simulation experiment shows that load balance, scalability and efficiency can be achieved in the system. This distributed Web-crawling subsystem has been successfully integrated with WebGather, a well-known Chinese and English Web search engine, aimed at collecting all the Web pages in China and keeping pace with the rapid growth of Chinese Web information. In addition, he believe that the design can also be useful in other context such as digital library, etc