国际数据挖掘著名学者,美国IBM Watson 研究院研究员Wei Fan博士将来我校访问,并于4月6日上午10时在教七404教室作学术报告。报告题目,简要,及演讲者简介如下,望有兴趣者踊跃参加。
谢谢留意!
From Feature Construction, to Simple but Effective Inductive Modeling, towards Domain Transfer
Wei Fan, http://www.cs.columbia.edu/~wfan, or http://www.weifan.info
This talk covers a "sequence" of solutions to some of the most important problems in data mining. In real-world applications, data is rarely in feature vector format, but normally semi-structured or unstructured. Examples include transaction sequences, social network, network connection events, biological sequences, still images and video. The main problem is that: in order to use most of today's inductive learning methods, one has to first come up with predictive feature vectors from these raw data. We discuss a method called Model-based Tree (or MbT) that use frequent patterns in raw data to search for highly predictive patterns in order to construct those good features. However, this is an NP-hard problem. The proposed method uses divide-conquer principle to avoid exhaustive search. It has linear scalability and can discover those features that trains model with accuracy higher than benchmark results on some of the most difficult problems. Many of the features constructed by MbT cannot even be found by any existing approaches without running into prohibitive combinatorial explosion. After data is already in feature vector format, the next important question is "which inductive algorithm to use"? There is a non-trivial algorithm selection process, given the fact that there are many inductive learning algorithms out there. We discuss a method called Random Decision Tree (RDT) that is remarkably simple to use, works for all three major inductive learning problems (classification, regression, and probability estimation). The main advantage of RDT is simplicity, accuracy, efficiency, naturally streaming and against sample selection bias. One of its applications on weather forecasting has won the ICDM06 application best paper award, and our submission using RDT to ICDM’08 Data Mining Contest has won the championship. The third important scenario is that training and testing data may not always come from the same distributions as one would desire. In the last part of the talk, we will discuss a few effective and novel approaches to transfer knowledge from a related but a different domain into target domain (examples include using Reuters data to predict New York Times article). The source data and data set of some of the solutions are available from the speaker's homepage http://www.cs.columbia.edu/~wfan
Bio Sketch of Wei Fan:
Dr. Wei Fan received his PhD in Computer Science from