Hancock(ID:4933/han001)Language for Extracting Signatures from Data Streams References: computed from call records. AT&T uses signatures to detect fraud and to target marketing. Code to compute signatures can be difficult to write and maintain because of the volume of data. We have designed and implemented Hancock, a C-based domain-specific programming language for describing signatures. Hancock provides data abstraction mechanisms to manage the volume of data and control abstractions to facilitate looping over records. This paper describes the design and implementation of Hancock, discusses early experiences with the language, and describes our design process. for data mining techniques. Transactions might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how customers or their intermediaries (called transactors) use the underlying services. For several years, we have computed evolving profiles (called signatures) of the transactors in large data streams using handwritten C code. The signature for each transactor captures the salient features of his transactions through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). C programs to compute signatures often sacrificed readability for performance. Consequently, they are difficult to verify and maintain. Hancock is a domain-specific language created to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources. in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details For several years, we have computed evolving profiles (called signatures) of the customers mentioned in large data streams. The signature for each customer captures the salient features of his transactions through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain. Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this talk, I will describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, I present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources. in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details |