Hancock(ID:4933/han001)


Language for Extracting Signatures from Data Streams



References:
  • Bonachea, D., K. Fisher, A. Rogers, and F. Smith. Hancock: A language for processing very large-scale data. In DSL'99, October 1999, pp. 163--176. view details Abstract: A signature is an evolving customer profile
    computed from call records. AT&T uses
    signatures to detect fraud and to target marketing.
    Code to compute signatures can be
    difficult to write and maintain because of the
    volume of data. We have designed and implemented
    Hancock, a C-based domain-specific
    programming language for describing signatures.
    Hancock provides data abstraction
    mechanisms to manage the volume of data
    and control abstractions to facilitate looping
    over records. This paper describes the
    design and implementation of Hancock, discusses
    early experiences with the language,
    and describes our design process.
  • Cortes, C.; Fisher, K.; Pregibon, D. and Hancock, A. Rogers. A Language for Extracting Signatures from Data Streams view details Abstract: Massive transaction streams present a number of opportunities
    for data mining techniques. Transactions might represent
    calls on a telephone network, commercial credit card
    purchases, stock market trades, or HTTP requests to a web
    server. While historically such data have been collected for
    billing or security purposes, they are now being used to discover
    how customers or their intermediaries (called transactors)
    use the underlying services.

    For several years, we have computed evolving profiles (called
    signatures) of the transactors in large data streams using
    handwritten C code. The signature for each transactor captures
    the salient features of his transactions through time.
    Programs for processing signatures must be highly optimized
    because of the size of the data stream (several gigabytes
    per day) and the number of signatures to maintain
    (hundreds of millions). C programs to compute signatures
    often sacrificed readability for performance. Consequently,
    they are difficult to verify and maintain.

    Hancock is a domain-specific language created to express
    computationally efficient signature programs cleanly. In this
    paper, we describe the obstacles to computing signatures
    from massive streams and explain how Hancock addresses
    these problems. For expository purposes, we present Hancock
    using a running example from the telecommunications
    industry; however, the language itself is general and applies
    equally well to other data sources.

          in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details
  • Kathleen Fisher "Hancock: A language for analyzing transactional data streams" Talk at California Thursday, November 21, 2002 view details Abstract: Massive transaction streams present a number of opportunities for data mining techniques. The transactions in such streams might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how "customers'' use the associated services, where the notion of a customer might be a telephone number, a credit card number, a trade account number, or an IP address.

    For several years, we have computed evolving profiles (called signatures) of the customers mentioned in large data streams. The signature for each customer captures the salient features of his transactions through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain.

    Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this talk, I will describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, I present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources.
          in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details
  • Cortes, Corinna et al "Hancock: A language for analyzing transactional data streams" (TOPLAS) 26(2) March 2004 pp301-338 view details Abstract: Massive transaction streams present a number of opportunities for data mining techniques. The transactions in such streams might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how the transactors, for example, credit-card numbers or IP addresses, use the associated services. Over the past 5 years, we have computed evolving profiles (called signatures) of transactors in several very large data streams. The signature for each transactor captures the salient features of his or her behavior through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain. Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources. DOI
          in Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000 view details