This is a short review of programs and packages available for public access, by anonymous ftp or web. It takes the form of cuts-and-pastes from newsgroup postings or email messages. No attempt has been made to list codes which can be had by directly contacting the author. No attempt has been made to list system-specific sites (e.g. SAS, XLisp-Stat). No attempt has been made to list commercial or shareware codes. No guarantees are given nor implied in respect to software referred to here.
F. Murtagh (fmurtagh @ astro.u-strasbg.fr, f.murtagh @ qub.ac.uk), May 1994. Updates Sept. 1994, July 1995, October 1996, February 1997, March 1997, November 1997, March 1998, April 1998, May 1998, January 1999, July 1999, March 2000, June 2000, April 2001, May 2001, June 2001, May, August 2002, May 2004.
Gopher to lib.stat.cmu.edu Anonymous ftp to lib.stat.cmu.edu URL: http://lib.stat.cmu.edu/ Here are some areas to check out: CMLIB - Core Mathematics Library from NIST. CLUSTER "is a sublibrary of Fortran subroutines for cluster analysis and related line printer graphics. It includes routines for clustering variables and/or observations using algorithms such as direct joining and splitting, Fisher's exact optimization, single-link, K-means, and minimum mutations, and routines for estimating missing values. The subroutines in CLUSTER are described in the book "Clustering Algorithms" by J. A. Hartigan." APSTAT - Selected Algorithms Transcribed from Applied Statistics. Mostly Fortran. Includes implementations of: minimal spanning tree, single-link hierarchical clustering, discriminant analysis of categorical data, branch and bound algorithm for feature subset selection, etc. GENERAL - Software of General Statistical Interest. Includes the 3-d interactive data display package, XGobi. Algorithms for convex hull, and Delaunay triangulation. Mclust, model-based clustering routines (Banfield and Raftery). MVE, minimum volume ellipsoid estimator (Rousseeuw), PROGRESS, robust regression (Rousseeuw and Leroy), MARS, projection pursuit. Nonlinear discriminant analysis. LOESS regression. Etc. MULTI - Multivariate Analysis and Clustering. Hierarchical clustering, principal components analysis, discriminant analysis. Former are mainly Fortran. Macintosh programs for multivariate data analysis and graphical display, linear regression with errors in both variables, software directory including details of packages for phylogeny estimation and to support consensus clustering. And of course... MULTIV - Clustering, PCA, Correspondence Analysis, from F. Murtagh. From Tim Hesterberg, May 17 2002: I just downloaded your 1994 version of multiv from statlib. For the most part it runs under Splus6 with no modifications. In particular, it appears that the .q files do not need to be modified to work with newer versions of S+ based on SV4; that was a pleasant surprise. I did run into the following problems: (1) On Linux and Windows sammon.f fails to compile. The fix is to switch these two lines: dimension x(n,m), y(n,p), dstar(ndis), d(ndis) integer n,m,p,i,j,k,iter,maxit,diag so that p is declared integer before being used. (2) The examples in help(bea) fail to run, because they require an object `a' which does not exist. I got them to run by first doing a <- author.count (3) In help(ca), change: text(corr$rproj[,1], corr$rproj[,2], labels=dimnames(bfposneg[])) to: text(corr$rproj[,1], corr$rproj[,2], labels=dimnames(bfposneg)[]) (4) The examples at the bottom of help(ca) fail to run, because they depend on the existence of objects `a', `b', `c' which are not defined in the library or earlier in the help file.
Anonymous ftp to netlib.att.com The programs from the "First and Second Multidimensional Scaling Packages of Bell Laboratories" are available in the subdirectory netlib/mds.
It can be obtained via anonymous, binary ftp from ftp.fct.unl.pt - pub/di/packages as tooldiag1.5.tar.Z.
Demonstration software in C-source form is available to researchers for non-commercial purposes only. (Contact author.)
Summary of responses to message in Vision-List Digest (20 April 1994) - see below for compiler, and subscription details to this Digest: Algorithm by Steve Fortune is available from email@example.com Use: "send sweep2 from voronoi" The alg calculates both Voronoi and Delaunay diagrams. Quickhull by anonymous ftp from geom.umn.edu get /pub/software/qhull.tar.Z The alg calculates the Delaunay triangulation and convex hull. nnsort.c Dave Watson
sent me a copy of nnsort.c which computes the Delaunay triangulation and convex hull in 2D and 3D. deltree.c Olivier Devillers sent a copy of deltree.c which computes the Voronoi/Delaunay diagrams and also has a function that returns the nearest neighbour pt. in the diagram to any arbitarily chosen point. He also includes an interactive interface in SunView. (Comments in French) Books: "Computational Geometry in C", by Joseph O'Rourke, Cambridge University Press, 1994, ISBN 0-521-44592-2. This has complete programs for Voronoi/Delaunay diagrams. [Msg. from firstname.lastname@example.org, in moderated Vision-List Digest membership requests to email@example.com] 3-d voronoi diagrams: vcs (John M. Sullivan, Geometry Center, Univ. Minn.; firstname.lastname@example.org): "code for 3-d voronoi diagrams". Available by anonymous ftp from: geom.umn.edu:pub/vcs.tar.Z
Newsgroups: sci.stat.math,sci.stat.edu,sci.stat.consult From: email@example.com (Warren Sarle) Date: Sun, 11 Sep 1994 18:35:20 GMT In "CART- Classification and Regression Trees", sci.stat.math article <firstname.lastname@example.org>, email@example.com (AJHorovitz) writes: |> |> CART-Classification and Regression Trees (Algorithms produced by |> California Statistical Software (Breiman, et al, 1984) and Interface by |> SALFORD SYSTEMS) |> ... |> CART is a new tree structured statistical analysis program that can |> automatically search for and find the hidden structure in your data. Based |> on the original work of some of the world's leading statisticians, CART is |> the only "stand-alone" tree-based program that can give you statistically |> valid results. Since the task of distributing information on empirical decision tree methodology seems to have fallen on me, I feel I should correct the misinformation in the post quoted above. I asked AJHorovitz whether he intended to say that FIRM, Knowledge Seeker, and Data Splits (to name but a few '"stand-alone" tree-based programs') are statistically invalid. The gist of his reply was that only cross-validation yields statistically valid results. FIRM and Knowledge Seeker do multiplicity-adjusted significance tests. While some statisticians have philosophical objections to significance tests, branding significance tests as invalid in advertising literature strikes me as misleading as anything in the Systat/Statistica debate. Even if we grant that significance tests are statistically invalid, we are left with the fact that Data Splits and IND both do the same kind of cross-validation as CART does. So the claim that 'CART is the only "stand-alone" tree-based program that can give you statistically valid results' is clearly incorrect. I set follow-ups to sci.stat.edu, since that is where the recent debate on statistical software marketing has been going on. Here is my summary of empirical decision tree software. I updated the information on CART to give Salford Systems address. .................................................................. There are many algorithms and programs for computing empirical decision trees. Several families can be identified with typical characteristics as listed below: The CART family: CART, tree (S), etc. Motivation: statistical prediction. Exactly two branches from each nonterminal node. Cross-validation and pruning are used to determine size of tree. Response variable can be quantitative or nominal. Predictor variables can be nominal or ordinal, and continuous predictors are supported. The CLS family: CLS, ID3, C4.5, etc. Motivation: concept learning. Number of branches equals number of categories of predictor. Only nominal response and predictor variables are supported in early versions, although I'm told that the latest version of C4.5 supports ordinal predictors The AID family: AID, THAID, CHAID, MAID, XAID, FIRM, TREEDISC, etc. Motivation: detecting complex statistical relationships. Number of branches varies from two to the number of categories of predictor. Statistical significance tests (with multiplicity adjustments in the later versions) are used to determine size of tree. AID, MAID, and XAID are for quantitative responses. THAID, CHAID, and TREEDISC are for nominal responses, although the version of CHAID from Statistical Innovations, distributed by SPSS, can handle a quantitative categorical response. FIRM comes in two varieties for categorical or continuous response. Predictors can be nominal or ordinal and there is usually provision for a missing-value category. Some versions can handle continuous predictors, others cannot. There are also a variety of methods that do splits on linear combinations rather than single predictors. I have not yet constructed a taxonomy for such methods. Some programs combine two or more families. For example, IND combines methods from CART and C4 as well as Bayesian and minimum encoding methods. Knowledge Seeker combines methods from CHAID and ID3 with a novel multiplicity adjustment. There are numerous unresolved statistical issues regarding these methods. Perhaps the most important is how big should the tree be? CART supporters claim that its pruning method using cross-validation is superior to the significance testing method used in the AID family. However, pruning is very easy and quick to do in the AID family since the p-values are computed while growing the tree and no cross-validation is required for pruning. The validity of CART cross-validation is suspect because CART seems to produce much smaller trees than the AID family, even using very conservative significance levels for the latter, which one would expect to validate well although empirical evidence is scarce. I have not seen any published comparison of CART and AID methods. This would make an excellent topic for a thesis. Some references: Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984), _Classification and Regression Trees_, Wadsworth: Belmont, CA. Chambers, J.M. amd Hastie, T.J. (1992), _Statistical Models in S_, Wadsworth & Brooks /Cole: Pacific Grove, CA. Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection", in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_, 267-302, Cambridge Univ Press: Cambridge. Morgan & Messenger (1973) _THAID--a sequential analysis program for the analysis of nominal scale dependent variables_, Survey Research Center, U of Michigan. Morgan & Sonquist (1963) "Problems in the analysis of survey data and a proposal", JASA, 58, 415-434. (Original AID) Morton, S.C. (1992) "New advances in statistical dendrology", Chance, 5, 76-79. See also letter to editor in volume 6 no. 1. Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan Kaufman: San Mateo, CA. The following information on software sources has been culled from previous posts and may be out of date or inaccurate: C4.5 C source code for a new, improved decision tree algorithm known as C4.5 is in the new book by Ross Quinlan (of ID3 fame). "C4.5: Programs for Machine Learning", Morgan Kaufmann, 1992. It goes for $44.95. With accompanying software on magnetic media it runs for $69.95. ISBN # 1-55860-238-0 CART Salford Systems, 341 N44th Street #711, Lincoln NE 68503, USA. Academic price is $399.00 (US). SYSTAT Corporation distributes a PC version of CART. They can be reached at SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201, USA. Phone: (708) 864-5670, FAX: (708) 492-3567. CHAID PC version from SPSS (800) 543-5831. Mainframe version from Statistical Innovations Inc., 375 Concord Avenue Belmont, Mass. 02178 Data Splits From Padraic Neville (510) 787-3452, $10 for preliminary release. FIRM `FIRM Formal Inference-based Recursive Modeling', University of Minnesota School of Statistics Technical Report #546, 1992. The writeup and a diskette containing executables is available from the U of M bookstore for $17.50. Incredible bargain! IND Version 2.0 should be available soon at a modest price from NASAs COSMIC center in Georgia, USA. Enquiries should be directed to: mail (to customer support): firstname.lastname@example.org Phone: (706) 542-3265 and ask for customer support FAX: (706) 542-4807. Knowledge Seeker Phone 613 723 8020. PC-Group is available from Austin Data Management Associates, P.O. Box 4358, Austin, TX 78765, (512) 320-0935. It runs on IBM and compatible personal computers with 512K of memory, and costs $495. A free demo version of the program is available upon request. New address, 20 July 1998 - new company name: Stepwise Systems, Inc. P.O. Box 4358 Austin, Texas 78765 Phone: 512-327-8861 Email: email@example.com Web: www.stepsys.com tree S: phone 800 462-8146 TREEDISC SAS macro using SAS/IML and SAS/OR available free from SAS Institute technical support (919) 677-8000. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here firstname.lastname@example.org SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
From: Ronny Kohavi
Date: Tue, 24 Jan 1995 MLC++, a Machine Learning library in C++. MLC++ is a library of C++ classes and tools for supervised Machine Learning being developed at the Robotics lab in Stanford University. Ronny Kohavi (ronnyk@CS.Stanford.EDU, http://robotics.stanford.edu/~ronnyk)
From: email@example.com (J.J. Merelo Guervos) Date: 30 Dec 1994 11:27:34 GMT Subject: Announcing S-LVQ 1.0.1 Dear fellow netters: After getting some bug reports from users, I have fixed S-LVQ and produced a new version, which is basically a bug fix from 1.0. Here is the blurb. S-LVQ is a quite simple program to perform Kohonen's LVQ algorithm. I know there is a very good program already made by Kohonen's team (LVQ_PAK), but, anyways, I had done it for my own purposes and thought it would be a good idea to release it into the public domain; it could be useful to somebody. Some features: -Command line interface to set the training file, test or validation file, number of neurons and number of epochs. -Easy file setup -Graphics interface written in TCL/TK, whichs allows to set the parameters and visualizes the results, as points if the training/weight vectors are 2-dimensional, and as lines if it is not. Changes from version 1.0: -Autoconfiguration -Bug fixes for Sun SPARCstations. If you want to know more about Kohonen's LVQ, this is the main reference: Kohonen, T.; "The Self-Organizing Map", Procs. IEEE, vol. 78, pp. 1464- 1480, 1990. It's available from the usual sources, that is 1. FTP: get it at ftp://kal-el.ugr.es/pub/s-lvq-1.0.1.tar.gz 2. ftpmail: use your favorite ftpmail server, or send a message to firstname.lastname@example.org with the body open get s-lvq close You'll receive an uu-encoded version of the former program 3. WWW: connect to GeNeura's home page at http://kal-el.ugr.es/geneura.html, and follow instructions. -- Dr. JJ Merelo Grupo Geneura ---- Univ. Granada
Some time ago we released the software package "LVQ_PAK" for the easy application of Learning Vector Quantization algorithms. Corresponding public-domain programs for the Self-Organizing Map (SOM) algorithms are now available via anonymous FTP on the Internet. "What does the Self-Organizing Map mean?", you may ask --- See the following reference, then: Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464-1480, 1990. In short, Self-Organizing Map (SOM) defines a 'non-linear projection' of the probability density function of the high-dimensional input data onto the two-dimensional display. SOM places a number of reference vectors into an input data space to approximate to its data set in an ordered fashion. This package contains all the programs necessary for the application of Self-Organizing Map algorithms in an arbitrary complex data visualization task. This code is distributed without charge on an "as is" basis. There is no warranty of any kind by the authors or by Helsinki University of Technology. In the implementation of the SOM programs we have tried to use as simple code as possible. Therefore the programs are supposed to compile in various machines without any specific modifications made on the code. All programs have been written in ANSI C. The programs are available in two archive formats, one for the UNIX-environment, the other for MS-DOS. Both archives contain exactly the same files. These files can be accessed via FTP as follows: 1. Create an FTP connection from wherever you are to machine "cochlea.hut.fi". The internet address of this machine is 126.96.36.199, for those who need it. 2. Log in as user "anonymous" with your own e-mail address as password. 3. Change remote directory to "/pub/som_pak". 4. At this point FTP should be able to get a listing of files in this directory with DIR and fetch the ones you want with GET. (The exact FTP commands you use depend on your local FTP program.) Remember to use the binary transfer mode for compressed files. The som_pak program package includes the following files: - Documentation: README short description of the package and installation instructions som_doc.ps documentation in (c) PostScript format som_doc.ps.Z same as above but compressed som_doc.txt documentation in ASCII format - Source file archives (which contain the documentation, too): som_p1r0.exe Self-extracting MS-DOS archive file som_pak-1.0.tar UNIX tape archive file som_pak-1.0.tar.Z same as above but compressed An example of FTP access is given below unix> ftp cochlea.hut.fi (or 188.8.131.52) Name: anonymous Password:
ftp> cd /pub/som_pak ftp> binary ftp> get som_pak-1.0.tar.Z ftp> quit unix> uncompress som_pak-1.0.tar.Z unix> tar xvfo som_pak-1.0.tar See file README for further installation instructions. All comments concerning this package should be addressed to email@example.com.
Date: Mon, 20 Feb 1995 08:01:37 +0000 From: "Warren L. Kovach"
Subject: WWW: Statistical and data analysis software I am pleased to announce my new World Wide Web pages focusing on shareware and public domain statistical and data analysis software. The URL is: http://www.compulink.co.uk/kovcomp These pages provide detailed information about and shareware copies of my programs MVSP and Oriana. MVSP is a multivariate statistical program for MS-DOS that calculates a variety of cluster analyses as well as PCA, PCO, and correspondence/detrended correspondence analysis. Oriana is my new circular statistics/orientation analysis package for Windows. The pages also have a list of resources on the Internet related to statistical software. In particular, there are many links to WWW pages and FTP sites that have software. I hope to maintain a definitive list of sources of shareware and public domain software on the Internet. If you know of sites that are not yet on my list I would appreciate hearing about them. For a bit of fun, there is also a page with information about the Isle of Anglesey, in North Wales, the home of Kovach Computing Services, and links to other WWW pages about Wales. Come and learn how to pronounce one of the longest placenames in the world! -- Dr. Warren L. Kovach Internet: WarrenK@kovcomp.demon.co.uk Kovach Computing Services tel./fax: +44-(0)1248-450414 85 Nant-y-Felin, Pentraeth, Anglesey CompuServe: 100016,2265 Wales LL75 8UY U.K. WWW: http://www.compulink.co.uk/kovcomp
Message to CLASS-L list on 5 July 1995: Re fuzzy clustering, how about probabilistic clustering?: i.e. we give a number of classes and then each data "thing" is probabilistically assigned to the various classes. Wallace founded the information-theoretic Minimum Message Length (MML) principle in 1968 (see also subsequent closely related work of Rissanen called 'MDL') with a clustering program called Snob. Snob is freely licensed for academic research, see Wallace and Dowe(1994) for details and many references, and see ~ftp/pub/snob/ on bruce.cs.monash.edu.au for Fortran source code. Some references to Snob (due to me, I believe) and other clustering algorithms (collated by Ray Liere) is given below. Doug Fisher's Cobweb algorithm is not mentioned by Ray Liere, presumably because Ray thought everyone on that mailing list knew it. I mention Cobweb now, and apologise to anyone whose favoUrite algorithm has not been mentioned - and invite them to tell me or CLASS-L of it. Please feel free to e-mail me (David Dowe, firstname.lastname@example.org) for further info on Snob or on MML. Please flame no-one :-) . Regards (and further info follows). - David Dowe. > >From email@example.com Tue May 30 09:54:08 1995 >Date: Mon, 29 May 1995 20:48:55 -0300 >From: Ray Liere
>Subject: Summary: Unsupervised Conceptual Clustering >To: Multiple recipients of list INDUCTIVE > >A few days ago (24 May), I posted a request for ideas on unsupervised >conceptual clustering, especially methods that are not based on the >assumption that each data object is categorized into exactly one >of the clusters. > >As you have seen, some responses were posted directly to this list. >I have also received several email replies. > >My thanks to everyone for the very constructive assistance. I received >many good leads to explore. > >And ... following is the promised summary of email responses that I received: >===== >>From: Chunyu Kit >> i am doing machine learning of NL grammar rules. i need an >> appropriate clustering approach to classify the higher categories >> found into some clusters that are expected to have some >> kind of correspondence to those ones in linguistic theories, >> like NP, PP, etc. >===== >>From: Daniel Fu >> there's a system OLOC (Overlapping concepts) that was described >> in the Machine Learning Journal maybe a year ago. It's shares a lot >> with COBWEB. >===== >>From: firstname.lastname@example.org (Brad Whitehall) >> Look at the CLUSTER and CLUSTER/s systems of Stepp and Michalski. >> They actually went to great pains to make it so clusters did NOT overlap. >> Michalski is now at George Mason University and might even be able >> to supply you with some code. >> >> I would also look at Fuzzy Clustering. I think you might find it much >> more useful for the types of problems described in your note. >===== >>From: email@example.com (David L Dowe) >> Chris Wallace developed Minimum Message Length (MML) in 1968, developing >> the Snob program for unsupervised conceptual clustering and also applying >> it to a real world problem of seal skulls in the same, 1968 paper. >> >> The most recent Snob reference is >> C S Wallace and D L Dowe, "Intrinsic classification by MML - the Snob >> program", Proc. 7th Australian Joint Conference on Artificial Intelligence >> (UNE, Armidale, NSW, Australia, November 1994), World Scientific, pp 37-44. >> >> and you might wish to look at ~ftp/pub/snob/ on bruce.cs.monash.edu.au . >> >> See also: >> C.S. Wallace, `Classification by Minimum-Message-Length >> Inference' S.G. Akl et al (eds.) Advances in Computing and >> Information - ICCI'90, Niagara Falls, Lecture Notes in Computer >> Science, No.468, Springer-Verlag, pp 72-81, 1990. >> >> Wallace.C.S., `An Improved Program for Classification', ACSC-9, vol 8, no >> 1, pp 357-366, February 1986. >> >> Wallace C.S. & Boulton, D.M., `An Information Measure for >> Classification' \fIComputer Journal\fP, Vol.11, No.2, 1968, >> pp 185-194. >> >> MML is described in the 1968 paper and in >> Wallace.C.S, Freeman.P.R., `Estimation and Inference by Compact Coding', >> The Journal of the Royal Statistical Society, Series B, Methodology, 49, 3, >> 1987, pp 223-265. >> >> with some outline in Wallace and Dowe(1994) and introductory material in >> C S Wallace and D L Dowe, "MML estimation of the von Mises concentration >> parameter", Technical Report #93/193, Department of Computer Science, Monash >> University, Clayton 3168, Australia. >> >> Autoclass is similar to the 1990 Snob (see Wallace, 1990, pp 78-80). >> The only changes to Snob since (Wallace and Dowe, 1994) have been to permit >> Poisson and (von Mises) circular variables. >> >> Peter Cheeseman is a former student of Prof. Wallace. >> >> Snob permits over-lapping mixtures. In fact (Wallace and Dowe, 1994; >> and earlier Wallace Snob work) it can lead to statistically biassed answers >> if you don't. >===== >>From: RORWIG@BPA.ARIZONA.EDU (Richard E. Orwig) >> We've done conceptual clustering using a Hopfield net and Kohonen net on >> textual data. The Hopfield technique was reported in Chen, Hsu, Orwig, >> Hoopes, and Nunamaker in last year's October _Communications of the ACM_. >> >> My dissertation (completed this past month) reports the use of a Kohonen >> self-organizing map for textual clustering. It should hit the microfilm >> service in a couple of weeks. >> >> A major difference between the two is exactly your point -- the Hopfield >> neural net creates conceptual cluster headings and uses the keywords to >> organize the text documents. Documents containing keywords in two or more >> cluster headings will map to two or more respective clusters. The Kohonen >> algorithm, on the other hand, maps the document to its "best" region on a >> two-dimensional concept map. I've had the map define a conceptual region >> on the map with no data in it because the documents which all contained the >> concept fit better in other regions. >===== >>From: firstname.lastname@example.org (Ranan Banerji) >> All my life I had a problem with clustering. Any clustering method is >> based on some idea of similarity, proximity etc., be they numerical, >> symbolic or whatever. This similarity is determined by what the researcher >> considers similar. Very often in an application area we need to think of >> two objects as similar when they demand similar action, or some other >> problem dependent criterion of similarity. Whenever I have looked, it >> has seemed to me that the similarity imposed by the problem and the >> similarity imposed by the intuition is not the same. So the problem lies >> in getting a match between the two measures. The problem of computational >> complexity (which seems to be the thing bothering you) comes way after that. >> Refining the clustering method (to somehow get around the mismatch) is >> what gives rise to the complexity. I have spent my life trying to >> develop and improve methods for getting the correct match, i.e to >> solving the so-called "representation problem". My own advice would >> be, concentrate on sharpening your intuition of the problem so you can >> prove to yourself that your measure matches the measure imposed by the >> problem. Once you have done that, any fast-and-easy technique of >> clustering will work. >===== >>From: beatriz >> I do not agree Autoclass allows an object in only one class >> because it assigns probabilities to any object. >> One of the advantages of Autoclass is that works in domains >> with noise and overlapping classes. Ver: "Bayesian classification" >> P. Cheeseman et al, 1988. >===== > >Ray Liere >email@example.com ---------------------------------------------------------------------------- More on SNOB, Feb. 1997, from: (Dr.) David Dowe, Dept of Computer Science, Monash University, Clayton, Victoria 3168, Australia firstname.lastname@example.org Fax:+61 3 9905-5146 http://www.cs.monash.edu.au/~dld/ ftp://ftp.cs.monash.edu.au/software/snob/ http://www.cs.monash.edu.au/~dld/mixture.modelling.page.html ------ Snob: Software developed by Chris Wallace and David Dowe for mixture modelling and clustering using the information-theoretic Minimum Message Length (MML) principle. Snob deals with data from Gaussian, multi-nomial (Bernoulli), Poisson and von Mises circular distributions, and deals with missing data. Snob has software for non-commercialuse, detailed documentation, a ReadMe file; with paper in postscript and latest paper being available. ---------------------------------------------------------------------------- Autoclass: http://ic-www.arc.nasa.gov/ic/projects/bayes-group/ group/html/autoclass-c-program.html Version 2.0, available 8 June 1995 (C code). New address for Autoclass, 15 Feb. 1999: http://ic-www.arc.nasa.gov/ic/projects/bayes-group/group/autoclass/autoclass-c-program.html Information on SNOB is also available at above site.
Availability of hte ADDTREE/P and EXTREE programs (message from James E. Corter, jec34@COLUMBIA.EDU, to the CLASS-L list on 28 July 1995): Programs for fitting additive trees and extended trees to proximity data are now available commercially, and over the INTERNET in the form of PASCAL source code and DOS-executable code. The ADDTREE/P program for fitting additive trees incorporates a variant (Corter, 1982) of the basic Sattath & Tversky algorithm (Sattath & Tversky, 1977). The EXTREE program (Corter & Tversky, 1986) fits the extended tree model. A procedure based on the Sattath-Tversky-Corter algorithm for fitting additive trees is available in the latest release (version 6.0) of SYSTAT for DOS, available from SPSS Inc., 444 N. Michigan Avenue, Chicago, IL 60611 (312) 329-3500. Also, a standalone version (DOS-executable) of the ADDTREE/P program (Corter, 1982) written in the PASCAL language is available free of charge from the author. No support is available with this version, and there is a upper limit on the number of objects that can be modeled of 80. The EXTREE program for fitting extended trees is also available (maximum n = 32). Those with access to a file transfer program such as FTP on the INTERNET can retrieve the DOS-executable versions as follows. First, FTP to ftp.ilt.columbia.edu and login as "anonymous", then connect ("cd") to the directory "users/corter". The program and documentation files can then be retrieved with the usual GET command (be sure to set the file transfer type to "BINARY" before GETing the executable files). Gopher users can get the files by gophering to gopher.ilt.columbia.edu and connecting to "users/corter". Finally, PASCAL source code for the ADDTREE/P and EXTREE programs is maintained at an INTERNET site: the "netlib/mds" library at AT&T Bell Labs. This resource may be accessed via email, by sending a message to the INTERNET address email@example.com containing only the single line send readme index from mds REFERENCES Corter, J.E. (1982). ADDTREE/P: A PASCAL program for fitting additive trees based on Sattath & Tversky's ADDTREE program. Behavior Research Methods and Instrumentation, 14, 353-354. Corter, J.E., & Tversky, A. (1986). Extended similarity trees. Psychometrika, 51, 429-451. Sattath, S., & Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319-345. ===================================== James E. Corter Dept. of Measurement, Evaluation, and Applied Statistics Teachers College, Columbia University New York, NY 10027 INTERNET: firstname.lastname@example.org =====================================
CLASS-L - 7 Aug 1995 to 23 Aug 1995 Date: Wed, 23 Aug 1995 18:58:09 +0200 From: Jean-Luc Voz
Subject: ELENA Classification databases and technical reports available Dear colleagues, The partners of the Elena project are pleased to announce you the availability of several databases related to classification together with two technical reports. ELENA is an ESPRIT III Basic Research Action project (No. 6891) >From July 92 to June 95 the ELENA project investigated several aspects of classification by neural networks, including links between neural networks and Bayesian statistical classification, incremental learning,... The project includes theoretical work on classification algorithms, simulations and benchmarks, especially on realistic industrial data. Hardware implementation, especially VLSI option, is the last objective. The set of databases available is to be used for tests and benchmarks of machine-learning classification algorithms. The databases are splitted into two parts: ARTIFICIALly generated databases, mainly used for preliminary tests, and REAL ones, used for objective benchmarks and comparisons of methods. The choice of the databases has been guided by various parameters, such as availability of published results concerning conventional classification algorithms, size of the database, number of attributes, number of classes, overlapping between classes and non-linearities of the borders,... Results of PCA and DFA preprocessing of the REAL databases are also included, together with several measures useful for the databases characterization (statistics, fractal dimension, dispersion,...). All these databases and their preprocessing are available together with a postcript technical report describing in details the different databases ('Databases.ps.Z' - 45 pages - 777781 bytes) and a report related to the comparative benchmarking studies of various algorithms ('Benchmarks.ps.Z' - 113 pages - 1927571 bytes) well-known by the Statistical and Neural Network communities (MLP, RCE, LVQ, k_NN, GQC) or developped in the framework of the Elena project (IRVQ, PLS). A LaTeX bibfile containing more than 90 entries corresponding to the Elena partners bibliography related to the project is also available ('Elena.bib') in the same directory. All files are available by anonymous ftp from the following directory: ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases The databases are splitted into two parts: the 'ARTIFICIAL' ones, being generated in order to obtain some defined characteristics, and for which the theoretical Bayes error can be computed, and the 'REAL' ones, collected in existing real-world applications. The ARTIFICIAL databases ('Gaussian', 'Clouds' and 'Concentric') were generated according to the following requirements: - heavy intersection of the class distributions, - high degree of nonlinearity of the class boundaries, - various dimensions of the vectors, - already published results on these databases. They are restricted to two-class problems, since we believe it yield answers to the most essential questions. The ARTIFICIAL databases are mainly used for rapid test purposes on newly developed algorithms. The REAL databases ('Satimage', 'Texture', 'Iris' and 'Phoneme') were selected according to the following requirements: - classical databases in the field of classification (Iris), - already published results on these databases (Phoneme, from the ROARS ESPRIT project and 'Satimage' from the STATLOG ESPRIT project), - various dimensions of the vectors, - sufficient number of vectors (to avoid the ``empty space phenomenon''). - the 'Texture' database, generated at INPG for the Elena project is interesting for its high number of classes (11). ############################################################################## ########### # DETAILS # ########### The 'Benchmarks' technical report ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 'Benchmarks.ps' Elena report is related to the benchmarking studies of various classifiers. Most of the classifiers which were used for the benchmark comparative studies are are well known by the neural network and machine learning community. These are the k-Nearest Neighbour (k_NN) classifier, selected for its powerful probability density estimation properties; the Gaussian Quadratic Classifier (GQC), the most classical statistical parametric simple classification method; the Learning Vector Quantizer (LVQ), a powerful non-linear iterative learning algorithm proposed by Kohonen; the Reduced Coulomb Energy (RCE) algorithm, an incremental Region Of Influence algorithm; the Inertia Rated Vector Quantizer (IRVQ) and the Piecewise Linear Separation (PLS) classifiers, developed in the framework of the Elena project. The main objectives of the 'Benchmarks.ps' Elena report report are the following: - to provide an overall comprehensive view of the general problem of comparative benchmarking studies and to propose a useful common test basis for existing and further classification methods, - to obtain objective comparisons of the different chosen classifiers on the set of databases described in this report (each classifier being used with its optimal configuration for each particular database), - to study the possible links between the data structures of the databases viewed by some parameters, and the behavior of the studied classifiers (mainly the evolution of their the optimal configuration parameters). - to study the links between the preprocessing methods and the classification algorithms from the performances and hardware constraints point of view (especially the computation times and memory requirements). Databases format ~~~~~~~~~~~~~~~ All the databases available are in the following format (after decompression) : - All files containing the databases are stored as ASCII files for their easy edition and checking. - In a file, each of the n lines is reserved for each vectorial sample (instance) and each line consists of d floating-point numbers (the attributes) followed by the class label (which must be an integer). Example: 1.51768 12.65 3.56 1.30 73.08 0.61 8.69 0.00 0.14 1 1.51747 12.84 3.50 1.14 73.27 0.56 8.55 0.00 0.00 0 1.51775 12.85 3.48 1.23 72.97 0.61 8.56 0.09 0.22 1 1.51753 12.57 3.47 1.38 73.39 0.60 8.55 0.00 0.06 1 1.51783 12.69 3.54 1.34 72.95 0.57 8.75 0.00 0.00 3 1.51567 13.29 3.45 1.21 72.74 0.56 8.57 0.00 0.00 1 There are NO missing values. If you desire to get a database, you MUST do it in ftp the binary mode. So if you aren't in this mode, simply type 'binary' at the ftp prompt. EXAMPLE: to get the "phoneme" database : cd REAL cd phoneme binary get phoneme.txt get phoneme.dat.Z get ... cd ... ... quit After your ftp session, you simply have to type 'uncompress phoneme.dat.Z' to get the uncompressed datafile. Contents of the 'ARTIFICIAL' directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The databases of this directory contain only the 'ARTIFICIAL' classification problems. The present 'ARTIFICIAL' databases are only two-class problems, since it yields answers to the most essential questions. For each problem, the confusion matrix corresponding to the theoretical Bayes boundary is provided with the confusion matrix obtained by a k_NN classifier (k chosen to reach the minimum of the total Leave-One-Out error). These databases were selected to use for preliminary test and to study the behavior of the implemented algorithms for some particular problems: - Overlapping classes: The classifier should have the ability to form a decision boundary that minimizes the amount of misclassification for all of the overlapping classes. - Nonlinear separability: The classifier should be able to build decision regions that separate classes of any shape and size. There is one subdirectory for each database. In this subdirectory, there is : - A text file providing detailed information about the related database ('databasename.txt'). - The compressed database ('databasename.dat.Z). The different patterns of each database are presented in a random order. - For bidimensional databases, a postscript file representing the 2-D datasets (those files are in eps format). For each subdirectory, the directoryname is the same as the name chosen for the concerned database. Here are the directorynames with a brief description. - 'clouds' Bidimensional distributions : the class 0 is the sum of three different normal distributions while the the class 1 is another normal, overlapping the class 0. 5000 patterns, 2500 in each class. This allows the study of the classifier behavior for heavy intersection of the class distributions and for high degree of nonlinearity of the class boundaries. - 'gaussian' A set of seven databases corresponding to the same problem, but with dimensionality ranging from 2 to 8. This allows the study of the classifier behavior for different dimensionalities of the input vectors, for heavy overlapped distributions and for non linear separability. Theses databases where already studied by Kohonen in: Kohonen, T. and Barna, G. and Chrisley, R., "Statistical Pattern Recognition with Neural Networks: Benchmarking Studies", IEEE Int. Conf. on Neural Networks, SOS Printing, San Diego, 1988. In this paper,the performances of three basis types of neural-like networks (Backpropagation network, Boltzmann machine and Learning Vector Quantization) is evaluated and compared to the theoretical limit. - 'concentric' Bidimensional uniform concentric circular distributions. 2500 instances, 1579 in class 1, 921 in class 0. This database may be used to study the linear separability of the classifier when some classes are nested in other without overlapping. Contents of the 'REAL' directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The databases of this directory contain only the real classification problem sets selected for the Elena benchmarking studies. There is one subdirectory for each database. In this subdirectory, there are: - a text file giving detailed information about the related database (`databasename.txt'), - the compressed original database in the Elena format (`databasename.dat.Z'); the different patterns of each database being presented in a random order. - By the way of a normalization process, each original feature will have the same importance in a subsequent classification process. A typical method is first to center each feature separately and than to reduce it to a unit variance; this process has been applied on all the REAL Elena databases in order to build the ``CR'' databases contained in the ``databasename_CR.dat.Z'' files. The Principal Components Analysis (PCA) is a very classical method in pattern recognition [Duda73]. PCA reduces the sample dimension in a linear way for the best representation in lower dimensions keeping the maximum of inertia. The best axe for the representation is however not necessary the best axe for the discrimination. After PCA, features are selected according to the percentage of initial inertia which is covered by the different axes and the number of features is determined according to the percentage of initial inertia to keep for the classification process. This selection method has been applied on every REAL database after centering and reduction (thus on the databasename_CR.dat files). When quasi-linear correlations exists between some initial features, these redundant dimensions are removed by PCA and this preprocessing is then recommended. In this case, before a PCA, the determinant of the data covariance matrix is near zero; this database is thus badly conditioned for all process which use this information (the quadratic classifier for example). The following files, related to PCA are also available for the REAL databases: - ``databasename_PCA.dat.Z'', the projection of the ``CR'' database on its principal components (sorted in a decreasing order of the related inertia percentage), - ``databasename_corr_circle.ps.Z'', a graphical representation of the correlation between the initial attributes and the two first principal components, - ``databasename_proj_PCA.ps.Z'', a graphical representation of the projection of the initial database on the two first principal components, - ``databasename_EV.dat'', a file with the eigenvalues and associated inertia percentages The Discriminant Factorial Analysis (DFA) can be applied to a learning database where each learning sample belongs to a particular class [Duda73]. The number of discriminant features selected by DFA is fixed in function of the number of classes (c) and of the number of input dimensions (d); this number is equal to the minimum between d and c-1. In the usual case where d is greater than c, the output dimension is fixed equal to the number of classes minus one and the discriminant axes are selected in order to maximize the between-variance and to minimize the within-variance of the classes. The discrimination power (ratio of the projected between-variance over the projected within-variance) is not the same for each discriminant axis: this ratio decreases for each axis. So for a problem with many classes, this preprocessing will not be always efficient as the last output features will not be so discriminant. This analysis uses the information of the inverse of the global covariance matrix, so the covariance matrix must be well conditioned (for example, a preliminary PCA must be applied to remove the linearly correlated dimensions). The DFA preprocessing method has been applied on the 18 first principal components of the 'satimage_PCA' and 'texture_PCA' databases (thus by keeping only the 18 first attributes of these databases before to apply the DFA preprocessing) in order to build the 'satimage_DFA.dat.Z' and 'texture_DFA.dat.Z' database files, having respectively 5 and 10 dimensions (the 'satimage' database having 6 classes and 'texture' 11). For each subdirectory, the directoryname is the same as the name chosen for the contained database. Here are the directorynames with a brief numerical description of the available databases. - phoneme French and Spannish phoneme recognition problem. The aim is to distinguish between nasal (AN, IN, ON) and oral (A, I, O, E, E') vowels. 5404 patterns, 5 attributes (the normalized amplitudes of the five first harmonics), 2 classes. This database was in use in the European ESPRIT 5516 project ROARS. The aim of this project is the development and the implementation of a REAL time analytical system for French and Spannish phoneme recognition. - texture The aim is to distinguish between 11 different textures (Grass lawn, Pressed calf leather, Handmade paper, Raffia looped to a high pile, Cotton canvas, ...), each pattern (pixel) being characterised by 40 attributes built by the estimation of fourth order modified moments in four orientations: 0, 45, 90 and 135 degrees. 5500 patterns, 11 classes of 500 instances (each class refers to a type of texture in the Brodatz album). The original source of this database is: P. Brodatz "Textures: A Photographic Album for Artists and Designers", Dover Publications, Inc., New York, 1966. This database was generated by the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF Grenoble, France) in the development of the Esprit project ELENA No. 6891 and the Esprit working group ATHOS No. 6620. - satimage (*) Classification of the multi-spectral values of an image of the Landsat satellite. Each line contains the pixel values in four spectral bands of each of the 9 pixels in a 3x3 neighbourhood and a number indicating the classification label of the central pixel (corresponding to the type of soil: red soil, cotton crop, grey soil, ...). The aim is to predict this classification, given the multi-spectral values. 6435 instances, 36 attributes (4 spectral bands x 9 pixels in neighbourhood), 6 classes. This database was in use in the European StatLog project, which involves comparing the performances of machine learning, statistical, and neural network algorithms on data sets from REAL-world industrial areas including medicine, finance, image analysis, and engineering design: D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors. Machine learning, Neural and Statistical Classification. Ellis Horwood Series In Artificial Intelligence, England, 1994. - iris (*) This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 4 attributes (sepal length, sepal width, petal length and petal width). (*) These databases are taken from the ftp anonymous "UCI Repository Of Machine Learning Databases and Domain Theories" (ics.uci.edu: pub/machine-learning-databases): Murphy, P. M. and Aha, D. W. (1992). "UCI Repository of machine learning databases" [Machine-readable data repository]. Irvine, CA: University of California, Department of Information and Computer Science. [Duda73] Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973. ############################################################################## The ELENA PROJECT ~~~~~~~~~~~~~~~~ Neural networks are now known as powerful methods for empirical data analysis, especially for approximation (identification, control, prediction) and classification problems. The ELENA project investigates several aspects of classification by neural networks, including links between neural networks and Bayesian statistical classification, incremental learning (control of the network size by adding or removing neurons),... URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/ELENA.html ELENA is an ESPRIT III Basic Research Action project (No. 6891). It involves: INPG (Grenoble, F), UPC (Barcelona, E), EPFL (Lausanne, CH), UCL (Louvain-la-Neuve, B), Thomson-Sintra ASM (Sophia Antipolis, F) EERIE (Nimes, F). The coordinator of the project can be contacted at: Prof. Christian Jutten, INPG-LTIRF, 46 av. Flix Viallet, F-38031 Grenoble Cedex, France Phone: +33 76 57 45 48, Fax: +33 76 57 47 90, e-mail: email@example.com A simulation environment (PACKLIB) has been developed in the project; it is a smart graphical tool allowing fast programming and interactive analysis. The PACKLIB environment greatly simplifies the user's task by requiring only to write the basic code of the algorithms, while the whole graphical input, output and relationship framework is handled by the environment itself. PACKLIB is used for extensive benchmarks in the ELENA project and in other situations (image processing, control of mobile robots,...). Currently, PACKLIB is tested by beta users and a demo version available in the public domain. URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/Packlib.html ############################################################################## IF YOU HAVE ANY PROBLEM, QUESTION OR PROPOSITION, PLEASE E_MAIL the following. VOZ Jean-Luc or Michel Verleysen Universite Catholique de Louvain DICE - Lab. de Microelectronique 3, place du Levant B-1348 LOUVAIN-LA-NEUVE E_mail : firstname.lastname@example.org email@example.com
Multidimensional scaling (from message from F. Murtagh, June 1995): On Statlib (http://lib.stat.cmu.edu/), Fortran or C code, go to S and then to multiv, where a Sammon map program in Fortran is available. Under ripley there should be a better implementation, but maybe more more integrated into S (to be checked again...). For Netlib, go to http://www.netlib.org/, then 'The Netlib Repository', then mds for all the 1960s Bell Labs material. *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* | | * Netlib/MDS is a collection of FREE programs for multidimensional * | scaling and related methods. | * * | -- NEW: Four entries covering PREFMAP3, SINDSCAL, and KYST2 | * * | -- NEW: Several DOS executable files | * * | -- Programs may be obtained by email, ftp, and web browser. | * * *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Netlib/MDS is a collection of programs having to do with multidimensional scaling and related methods, including PREFMAP, SINDSCAL (INDSDCAL), ADDTREE, EXTREE, KYST, MDSCAL, HICLUST, and MDPREF (some in multiple versions). Netlib/MDS is one of many libraries (currently about 140) which are maintained at and distributed by Netlib at several sites around the world. For further information, send email containing only this line send readme from mds to firstname.lastname@example.org Our thanks to Patrick Groenen (Leiden University, The Netherlands), Phipps Arabie (Rutgers University, USA), Jacqueline Meulman (Leiden University, The Netherlands), for providing the new programs, and to Joaquin Sanchez (Complutense University, Spain) for other help. <>----------------<>----------------<>----------------<>-----------------<> Joseph B Kruskal, Bell Labs, Lucent Technologies Room 2C-281, Murray Hill, NJ 07974 EMAIL email@example.com PHONE 908-582-3853 FAX 908-582-2379 HOMEPAGE http://cm.bell-labs.com/cm/ms/departments/sia/kruskal/index.html <>----------------<>----------------<>----------------<>-----------------<>
Maria Wolters asked: > I'm looking for public domain classification tree induction software. > Our target data is linguistic (letters & part-of-speech tags). the Other Phylogeny Programs web page at our PHYLIP web site lists 88 packages (yes, there are that many!), many of them freely copyable. It also has a link to the Classification Society's list of freely copyable classification software. The URL is: http://evolution.genetics.washington.edu/phylip/software.html -- Joe Felsenstein firstname.lastname@example.org (IP No. 184.108.40.206) Dept. of Genetics, Univ. of Washington, Box 357360, Seattle, WA 98195-7360 USA ------------------------------------------------------------------------------ From: "Ted E. Dunning"
Subject: Re: Classification tree software for symbolic data ... Maria Wolters wants decision tree software ... look at the following http pages http://www.sgi.com/Technology/mlc/trees.html http://www.cs.jhu.edu/~salzberg/announce-oc1.html
Date: Tue, 11 Mar 97 13:44:51 -0800 From: email@example.com To: firstname.lastname@example.org Subject: New model-based clustering software and papers Several new pieces of software and papers on model-based clustering are now available over the Web, produced by the MCLUST project at the University of Washington. They can be accessed from http://www.stat.washington.edu/raftery/Research/Mclust/mclust.html (click on "Papers" or "Software"). The new software is: * mclust-em: 2-dimensional model-based clustering with clutter using the EM algorithm * Principal Curve Clustering Software * Nearest Neighbor Cleaning of Spatial Point Processes The new papers are: * Principal Curve Clustering with Noise. Derek Stanford and Adrian E. Raftery. * Non-parametric Maximum Likelihood Estimation of Features in Spatial Point Processes Using Voronoi Tesselation (revised version). Denis Allard and Chris Fraley * Linear Flaw Detection in Woven Textiles using Model-Based Clustering. John G. Campbell, Chris Fraley, Fionn Murtagh and Adrian E. Raftery. * Algorithms for Model-Based Gaussian Hierarchical Clustering. Chris Fraley. * Nearest Neighbor Clutter Removal for Estimating Features in Spatial Point Processes. Simon Byers and Adrian E. Raftery ------------------------------------------------------------------------------ Date: Thu, 13 Mar 1997 10:04:44 +1300 From: Murray Jorgensen
Yet more model-based clustering software Emboldened by the announcement of the MCLUST project group at the University of Washington the MULTIMIX group at the University of Waikato (Lynette Hunt and Murray Jorgensen) announce the availability of the MULTIMIX program, which clusters data having both categorical and continuous variables, possibly containing missing observations. The class of models fitted is described in the (Plain) TeX code which follows and generalizes both Latent Class Analysis and Mixtures of Multivariate Normals. We hope soon to have this software available on our ftp site. If you are interested in downloading this software please send us you email address and we will notify you when the program will be available. Date: Tue, 11 Nov 1997 16:31:14 +1300 From: Murray Jorgensen I announced earlier this year on this list that the _Multimix_ program would 'shortly' be available. Multimix was written by Lyn Hunt to fit mixture models to multivariate data sets as an alternative to other approaches to cluster analysis (unsupervised learning). I apologise for the delay, but I am please to announce that Multimix can now be downloaded from ftp://ftp.math.waikato.ac.nz/pub/maj/ We have decided to make the Fortran 77 source code available so that you will be able to customise Multimix to your own data and platform. The sizes of the multidimensioned arrays used in Multimix are governed by parameter statements which may need to be changed from the supplied values to suit your needs. For those who are not accustomed to a statistical modelling approach I should make clear that in specifying the model it is important to keep the number of estimated parameters as low as possible consistant with a good fit to the data. Unlike some other approaches Multimix does not attempt to determine an optimal number of clusters. We recommend that you first explore solutions with 2, 3, 4, ... clusters before attempting to go any further. (I say this because when I requested information about array parameter settings it emerged from several emails that several respondants were seeking what we would regard as quite a large number of clusters.) Before attempting to fit your own data we recommend that you try to reproduce the output for the Cancer example data and model supplied. The file README.TXT describes the files available in this distribution and I will paste it into this email below as well. Read the paper TALK.DVI/TALK.PS before getting started, then read NOTES.DVI or NOTES.PS for some program documentation. Happy mixture modelling! Multimix.for contains the program code for fitting a finite mixture of K groups to the data. [Missing.for] contains a version of Multimix.for which can handle missing values in the variables. [Currently unavailable while minor changes are being made.] Talk.dvi Dvi and Postscript versions of a paper presented Talk.ps on 23 August 1996 to the conference ISIS96, Information, Statistics and Induction in Science, held in Melbourne, Australia.[Published in the proceedings of the Conference, edited by D. L. Dowe, K. B. Korb and J. J. Oliver, World Scientific: Singapore] Notes.ps is a postscript file giving information about the input required to run Multimix. Please read this file. Read3.for contains program code for setting up a parameter input file for program Multimix. This is useful when setting up the first few runs with a data set. Later it is easier to modify existing files with a text editor. Flexi This subdirectory contains a Bayesian smoothing program written by Martin Upsdell. It is not connected with Multimix in any way. Read about Flexi in Flexi/Info.txt. Martin's email address is email@example.com. EXAMPLE OF DATA FILE, INPUT FILE, AND OUTPUT FILES Cancer11.dat contains the cancer data file. Cancerdesc.txt A description of the data in Cancer11.dat. 2band.dat contains a parameter input file for the cancer data. A two-component mixture model is to be fitted. The variables are partitioned into blocks. Each block or 'cell' is assumed independent of the others within each component. In the model fitted by 2band.dat the distributions of the variables in each block are 1 Univariate Normal 2 3-category Discrete 3 2-category Discrete 4 Trivariate Normal 5 7-category Discrete 6 Univariate Normal 7 Univariate Normal 8 Univariate Normal 9 Univariate Normal 10 2-category Discrete There is some re-ordering of variables to make the variables in each block contiguous. An initial grouping of the observations into two clusters is specified. Alternatively initial parameter values could have been given. General.out is the output file generated when using the parameter file 2band.dat. Groups.out contains the group assignment and the posterior probabilities of assignment to the two groups when using the parameter file 2band.dat. Queries to Murray Jorgensen . Murray Jorgensen, Department of Statistics, U of Waikato, Hamilton, NZ -----[+64-7-838-4773]---------------------------[firstname.lastname@example.org]-----
Date: Wed, 12 Mar 1997 13:39:13 -0800 (PST) From: Jan Deleeuw
Let me point out once again that for projects like this the Journal of Statistical Software is a nice repository. Statlib is a zoo, without any proper organization. JSS provides peer review, nice formatting, guestbooks for comments, demos when appropriate, code testing by reviewers. Moreover JSS gets hundreds of hits each day. Of course authors maintain copyright, i.e. they can put code in statlib, on their own ftp servers, sell it, whatever, in addition to submitting to JSS. See http://www.stat.ucla.edu/journals/jss/v01/i04/ for a recent clustering example (still partly under conbstruction).
http://astro.u-strasbg.fr/~fmurtagh/mda-sw NEW, August 2002: Java application versions of some of these programs, to be expanded over the coming months.
DOS-based programs from Glenn Milligan at Ohio State University
Département des Sciences biologiques, Université de Montréal The R Package: Multivariate and spatial analyses. Spatial autocorrelation, Mantel tests, many kinds of clustering methods and more! Permute! 3.2: Multiple regression on distance (Mantel test), ultrametric (Double permutation test) and additive (Triple permutation test) matrices.
From: email@example.com, connectionists list, Wed, 12 Nov 97. Just to let you know of the http availability of a new software for Independent Component Analysis (ICA) and Blind Separation of Sources (BSS). The Laboratory for Open Information Systems in the Research Group of Professor S. AMARI, (Brain-Style Information Processing Group) BRAIN SCIENCE INSTITUTE -RIKEN, JAPAN announces the availability of OOLABSS (Object Oriented LAboratory for Blind Source Separation), an experimental laboratory for ICA and BSS. OOLABSS has been developed by Dr. A. CICHOCKI and Dr. B. ORSIER (both worked on the concept of this software and on the development/unification of learning algorithms, while Dr. B. ORSIER designed and implemented the software in C++ under Windows95/NT). OOLABAS offers an interactive environment for experiments with a very wide family of recently developed on-line adaptive learning algorithms for Blind Separation of Sources and Independent Component Analysis. OOLABSS is free for non-commercial use. The current version is still experimental but is reasonably stable and robust. The program has the following features: 1. Users can define their own activation functions for each neuron (processing unit) or use a global activation function (e.g. hyperbolic tangent) for all neurons. 2. The program also enables automatic (self-adaptive) selection of quasi optimal activation functions (time variable or switching) depending on the stochastic distribution of extracted source signals (so called extended ICA problem). 3 Users can add a noise both to sensors signals as well as to synaptic weights. 4. The number of sources, sensors and outputs of the neural network can be arbitrary defined by users. 5. In the case where the number of source signals is completely unknown one of the proposed approaches enables not only to estimate source signals but also to estimate correctly their number on-line without any pre-processing, like pre-whitening or Principal Component Analysis (PCA). 6. The problem of optimal updating of a learning rate (step) is a key problem encountered in a wide class of on-line adaptive learning algorithms. Relying on properties of nonlinear low-pass filters a family of learning algorithms for self-adaptive (automatic) updating of learning rates (global one or local-individual for each synaptic weight) are implemented in the program. The learning rates can be self-adaptive, i.e. quasi optimal annealing of learning rates is automatically provided in a stationary environment. In a non-stationary environment the learning rates adaptively change their value to provide good tracking abilities. The users can also define their own function for changing the learning rate. 6. The program enables to compare performance of several different algorithms. 7. Special emphasis is given to robust algorithms with respect to noise and outliers and equivariant feature (i.e. independence of asymptotic performance for ill conditioning of the mixing process). 8. Advanced graphics: illustrative figures are produced and can be easily printed. Encapsulated Postscript files can be produced for easy integration into word processors. Data can be pasted to the clipboard for post-processing using specialized software like Matlab or even spreadsheets. 9. Users can easily enter their own data (sensors signals, or sources and mixing matrix, noise, a neural network model, etc.) in order to experiment with various kind of algorithms. 10. Modular programming style: the program code is based on well-defined C++ classes and is very modular, which makes it possible to tailor the software to each user's specific needs. Please visit OOLABSS home page at URL: http://www.bip.riken.go.jp/absl/orsier/OOLABSS The version is 1.0 beta, so comments, suggestions and bug reports are welcome at the address: firstname.lastname@example.org
FastICA, a new MATLAB package for independent component analysis, is now available at: http://www.cis.hut.fi/projects/ica/fastica/ FastICA is a public-domain package that implements the fast fixed-point algorithm for ICA, and features an easy-to-use graphical user interface. The fixed-point algorithm is a computationally highly efficient method for ICA: in independent experiments it has been found to be 10-100 times faster than conventional gradient descent methods for ICA. Another advantage of the fixed-point algorithm is that it can be used to perform projection pursuit, estimating the independent components one-by-one. Aapo Hyvarinen on behalf of the FastICA Team at the Helsinki University of Technology email@example.com
Documentation, DOS32, W3.1, W95, OS/2 executables, source code, test data and results, GIF images of trees output. Rand index. All available at http://220.127.116.11/clopt.
From N. Sriram, swknasri@LEONIS.NUS.EDU.SG
Please note that RASA version 2.2 software for measuring phylogenetic signal and data analysis has been uploaded and can be downloaded at the following URL as a binhexed, self-extracting archive: http://test1.bio.psu.edu/LW/list.htm and by anonymous ftp at loco.biology.unr.edu (pub) (rasa) This software (for Mac only) corrects a serious bug in the power&effect analysis option in RASA 2.1. Please pass this announcement along to any user you may know who might not be a subscriber to CLASS-L. RASA 2.2 also offer two null hypothesis formulations: the original, analytical, equiprobable null and a new permutation null that provides a better fit of the test to the student-t distribution. Other features include: a tool for the detection of otherwise cryptic long edge taxa, which cause inconsistency of methods of tree-building. This tool (the taxon variance plot) was recently described in Lyons-Weiler, J., and G.A. Hoelzer. 1997. Escaping from the Felsenstein Zone by detecting long branches in phylogenetic data. Molecular Phylogenetics and Evolution 8:375-384. a test for the suitability of outgroup taxa for rooting trees, to be described in Lyons-Weiler, J., G.A. Hoelzer and R.J. Tausch. 1998. Optimal Outgroup Analysis. Biological Journal of the Linnean Society (in press). new experimental treatments of phylogenetic data, including a type of waveform analysis that reveals structure in biological sequences The software is menu-driven, with the following options (some not yet activated): 2.1 File Menu 2.1.1 Open 2.1.2 Open Several 2.1.3 Open Results 2.1.4 Close 2.1.5 Save Results As 2.1.6 Export Modified Matrix 2.1.7 Print 2.1.8 Quit 2.2 Analysis Menu 2.2.1 Signal Content 2.2.2 SC Recursive 2.2.3 Optimal Outgroup Analysis 2.2.4 Power and Effect 2.2.5 Colonization/Extinction Ratio 2.2.6 Character Compatibility 2.2.7 Signal Waveform 2.3 Graphs 2.3.1 Regression 2.3.2 Taxon Variance Plot 2.3.3 Show Signal Waveform 2.3.4 Residual Plots 2.3.5 RASA Table 2.3.6 Show Data Matrix 2.4 Data 2.4.1 Include/Exclude Taxa 2.4.2 Define Outgroup Taxa 2.4.3 Remove Invariant Characters 2.4.4 Remove APPARENT Autapomorphies 2.4.5 Recode Purines and Pyrimidines 2.4.6 Create Combined Data Matrix 2.4.7 Delete Noisy Characters 2.5 Windows 2.5.1 Clear the Screen 2.5.2 Main Display 2.5.3 Help 2.5.4 References 2.5.5 Acknowledgements 2.5.6 Close All Please send questions to firstname.lastname@example.org Message Date: Thu, 19 Feb 1998 16:38:18 -0800 From: James Francis Lyons-Weiler weiler@ERS.UNR.EDU Update message, Fri, 25 Sep 1998, James Lyons-Weiler
The ILK (Induction of Linguistic Knowledge) Research Group at Tilburg University, The Netherlands, announces the release of a new version of TiMBL, Tilburg Memory Based Learner (version 2.0). TiMBL is a machine learning program implementing a family of Memory-Based Learning techniques. TiMBL stores a representation of the training set explicitly in memory (hence `Memory Based'), and classifies new cases by extrapolating from the most similar stored cases. TiMBL features the following (optional) metrics and speed-up optimalizations that enhance the underlying k-nearest neighbor classifier engine: - Information Gain weighting for dealing with features of differing importance (the IB1-IG learning algorithm). - Stanfill & Waltz's / Cost & Salzberg's (Modified) Value Difference metric for making graded guesses of the match between two different symbolic values. - Conversion of the flat instance memory into a decision tree, and inverted indexing of the instance memory, both yielding faster classification. - Further compression and pruning of the decision tree, guided by feature information gain differences, for an even larger speed-up (the IGTREE learning algorithm). The current version is a complete rewrite of the software, and offers a number of new features: - Support for numeric features. - The TRIBL algorithm, a hybrid between decision tree and nearest neighbor search. - An API to access the functionality of TiMBL from your own C++ programs. - Increased ability to monitor the process of extrapolation from nearest neighbors. - Many bug-fixes and small improvements. TiMBL accepts commandline arguments by which these metrics and optimalizations can be selected and combined. TiMBL can read the C4.5 and WEKA's ARFF data file formats as well as column files and compact (fixed-width delimiter-less) data You are invited to download the TiMBL package for educational or non-commercial research purposes. When downloading the package you are asked to register, and express your agreement with the license terms. TiMBL is *not* shareware or public domain software. If you have registered for version 1.0, please be so kind to re-register for the current version. The TiMBL software package can be downloaded from http://ilk.kub.nl/software.html or by following the `Software' link under the ILK home page at http://ilk.kub.nl/ . The TiMBL package contains the following: - Source code (C++) with a Makefile. - A reference guide containing descriptions of the incorporated algorithms, detailed descriptions of the commandline options, and a brief hands-on tuturial. - Some example datasets. - The text of the licence agreement. - A postscript version of the paper that describes IGTREE. The package should be easy to install on most UNIX systems. Background: Memory-based learning (MBL) has proven to be quite successful in a large number of tasks in Natural Language Processing (NLP) -- MBL of NLP tasks (text-to-speech, part-of-speech tagging, chunking, light parsing) is the main theme of research of the ILK group. At one point it was decided to build a well-coded and generic tool that would combine the group's algorithms, favorite optimization tricks, and interface desiderata. The current incarnation of this is now version 2.0 of TiMBL. We think TiMBL can make a useful tool for NLP research, and, for that matter, for any other domain in machine learning. For information on the ILK Research Group, visit our site at http://ilk.kub.nl/ On this site you can find links to (postscript versions of) publications relating to the algorithms incorporated in TiMBL and on their application to NLP tasks. The reference guide ("TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference Guide.", Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. ILK Technical Report 99-01) can be downloaded separately and directly from http://ilk.kub.nl/~ilk/papers/ilk9901.ps.gz For comments and bugreports relating to TiMBL, please send mail to Timbl@kub.nl
The UCI KDD Archive The UC Irvine Knowledge Discovery in Databases (KDD) Archive is a new online repository (http://kdd.ics.uci.edu/) of large datasets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to serve as a benchmark testbed to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets. This archive is supported by the Information and Data Management Program at the National Science Foundation, and is intended to expand the current UCI Machine Learning Database Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html) to datasets that are orders of magnitude larger and more complex. We are seeking submissions of large, well-documented datasets that can be made publicly available. Data types and tasks of interest include, but is not limited to: Data Types Tasks multivariate classification time series regression sequential clustering relational density estimation text/web retrieval image causal modeling spatial visualization multimedia discovery transactional exploratory data analysis heterogeneous data cleaning sound/audio recommendation systems Submission Guidelines: Please see the UCI KDD Archive web site for detailed instructions. Stephen Bay (email@example.com) librarian
R is a public statistical package with many utilities of use to statisticians. The main R master site is http://www.ci.tuwien.ac.at/R/ and a US mirror site is http://cran.stat.wisc.edu/
Date: Tue, 6 Jul 1999 11:40:14 -0500 From: Chong Gu
Dear fellow R users, I just uploaded a new package gss to ftp.ci.tuwien.ac.at. The package name gss stands for General Smoothing Spline. In the current version (0.4-1), it handles nonparametric multivariate regression with Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, and Negative Binomial responses. I am still working on code for density estimation and hazard rate estimation to be made available in future releases. On the modeling side, gss uses tensor-product smoothing splines to construct nonparametric ANOVA structures using cubic spline, linear spline, and thin-plate spline marginals. The popular (main-effect-only) additive models are special cases of nonparametric ANOVA models. The syntax of gss functions resembles that of the lm and glm suites. Among new features that are not available from other spline packages are the standard errors needed for the construction of Wahba's Bayesian confidence intervals for smoothing spline fits, so you may want to try out gss even if you only wants to calculate a univariate cubic spline or a single term thin-plate spline. For those familiar with smoothing splines, gss is a front end to RKPACK, which encodes O(n^3) generic algorithms for reproducing kernel based smoothing spline calculation. Reports on bugs and suggestions for improvements/new features are most welcome. Chong Gu
I am pleased to announce a major new release of the Bayes Net Toolbox, a software package for Matlab 5 that supports inference and learning in directed graphical models. Specifically, it supports exact and approximate inference, discrete and continuous variables, static and dynamic networks, and parameter and structure learning. Hence it can handle a large number of popular statistical models, such as the following: PCA/factor analysis, logistic regression, hierarchical mixtures of experts, QMR, DBNs, factorial HMMs, switching Kalman filters, etc. For more details, and to download the software, please go to http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html The new version (2.0) has been completely rewritten, making it much easier to read, use and extend. It is also somewhat faster. The main change is that I now make extensive use of objects. (I used to use structs, and a dispatch mechanism based on the type-tag system in Abelson and Sussman.) In addition, each inference algorithm (junction tree, sampling, loopy belief propagation, etc.) is now an object. This makes the code and documentation much more modular. It also makes it easier to add special-case algorithms, and to combine algorithms in novel ways (e.g., combining sampling and exact inference). I have gone to great lengths to make the source code readable, so it should prove an invaluable teaching tool. In addition, I am hoping that people will contribute algorithms to the toolbox, in the spirit of the open source movement. Kevin Murphy
I would very much appreciate if you could add a link to my fuzzy clustering algorithms on the web (www.fuzzy-clustering.de). The fc package (UNIX, C++, GPL licensed) comes along with a number of fuzzy clustering algorithms and tools for data manipulation and visualization. Dipl.-Inform. Frank Hoeppner University of Applied Sciences OOW Constantiaplatz 4 D-26723 Emden e-mail firstname.lastname@example.org www http://www.fuzzy-clustering.de
Latent class analysis package. Dr. Jay Magidson (Statistical Innovations) will be giving a workshop on Latent Class Analysis at the CSNA Annual Meeting, St Louis MO, 2001. Further information, including links to a free software download and tutorial.
Comprehensive site, The Three Mode Company, including: bibliographies, software, data sets, addresses of active three-mode researchers, news about three-mode activities.
Web address of The Three-Mode Company, three-mode.leidenuniv.nl
P.M. Kroonenberg, Department of Education, Leiden University Wassenaarseweg 52, 2333 AK Leiden, The Netherlands. Tel. *31-71-527 3446; Fax *31-71-527 3945 kroonenb at fswrul.fsw.leidenuniv.nl
From: Balazs Kegl
I updated my Principal Curves web page and moved it to http://www.iro.umontreal.ca/~kegl/research/pcurves/ Recent references are included, and a new version of the java implementation of the Polygonal Line Algorithm [1,2] is available. The most important new features are - arbitrary-dimensional input data - loading/downloading your own data and saving the results - adjusting the parameters of the algorithm in an interactive fashion  B. Kegl, A. Krzyzak, T. Linder, and K. Zeger "Learning and design of principal curves" IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 22, no. 3, pp. 281-297, 2000. http://www.iro.umontreal.ca/~kegl/research/publications/keglKrzyzakLinderZeger99.ps  B. Kegl "Principal curves: learning, design, and applications," Ph. D. Thesis, Concordia University, Canada, 1999. http://www.iro.umontreal.ca/~kegl/research/publications/thesis.ps Balazs Kegl Balazs Kegl Assistant Professor E-mail: email@example.com Dept. of Computer Science and Op. Res. Phone: (514) 343-7401 University of Montreal Fax: (514) 343-5834 CP 6128 succ. Centre-Ville http://www.iro.umontreal.ca/~kegl/ Montreal, Canada H3C 3J7
A new version of SVM-Light (V5.00) is available, as well as my dissertation "Learning to Classify Text using Support Vector Machines", which recently appeared with Kluwer. The new version can be downloaded from http://svmlight.joachims.org/ SVM-Light is an implementation of Support Vector Machines (SVMs) for large-scale problems. The new features of this version are the following: - Learning of ranking functions (e.g. for search engines), in addition to classification and regression. - Bug fixes and improved numerical stability. The dissertation describes the algorithms and methods implemented in SVM-light. In particular, it shows how these methods can be used for text classification. Links are on my homepage http://www.joachims.org/ Cheers Thorsten --- Thorsten Joachims Assistant Professor Department of Computer Science Cornell University http://www.joachims.org/
From: Andrzej CICHOCKI Date: Fri, 23 Aug 2002 We would like to announce availability of software packages called ICALAB for ICA (Independent Component Analysis), BSS (Blind Sources Separation) and BSE (Blind Signal Extraction). ICALAB for Signal Processing and ICALAB for Image Processing are two independent packages for MATLAB that implement a number of efficient algorithms for ICA employing HOS (higher order statistics), BSS employing SOS (second order statistics) and LTP (linear temporal prediction), and BSE employing various SOS and HOS methods. After some data preprocessing, these packages can also be used also for MICA (multidimensional independent component analysis) and NIBSS (non independent blind source separation). The main features of both packages are an easy-to-use graphical user interface, and implementation of computationally powerful and efficient algorithms. Some implemented algorithms are robust with respect to additive white noise. The packages are available on our web pages: http://www.bsp.brain.riken.go.jp/ICALAB Any critical comments and suggestions are welcomed. Best regards, Andrzej Cichocki
From: Radford Neal To: firstname.lastname@example.org CC: Radford Neal Subject: New software release / Dirichlet diffusion trees Date: Mon, 30 Jun 2003 11:38:50 -0400 Announcing a new release of my SOFTWARE FOR FLEXIBLE BAYESIAN MODELING Features include: * Regression and classification models based on neural networks and Gaussian processes * Density modeling and clustering methods based on finite and infinite (Dirichlet process) mixtures and on Dirichlet diffusion trees * Inference for a variety of simple Bayesian models specified using BUGS-like formulas * A variety of Markov chain Monte Carlo methods, for use with the above models, and for evaluation of MCMC methodologies Dirichlet diffusion tree models are a new feature in this release. These models utilize a new family of prior distributions over distributions that is more flexible and realistic than Dirichlet process, Dirichlet process mixture, and Polya tree priors. These models are suitable for general density modeling tasks, and also provide a Bayesian method for hierarchical clustering. See the following references: Neal, R. M. (2003) "Density modeling and clustering using Dirichlet diffusion trees", to appear in Bayesian Statistics 7. Neal, R. M. (2001) "Defining priors for distributions using Dirichlet diffusion trees", Technical Report No. 0104, Dept. of Statistics, University of Toronto, 25 pages. Available at http://www.cs.utoronto.ca/~radford/dft-paper1.abstract.html The software is written in C for Unix and Linux systems. It is free, and may be downloaded from http://www.cs.utoronto.ca/~radford/fbm.software.html Radford M. Neal email@example.com
From: Avi Pfeffer To: firstname.lastname@example.org Subject: Announcing IBAL release Date: Tue, 01 Jul 2003 11:04:38 -0400 Readers of this list may be interested in the following announcement: I am pleased to announce the initial release of IBAL, a general purpose language for probabilistic reasoning. IBAL is highly expressive, and its inference algorithm generalizes many common frameworks as well as allowing many new ones. It also provides parameter estimation and decision making. All this is packaged in a programming language that provides libraries, automatic type checking, etc. IBAL may be downloaded from http://www.eecs.harvard.edu/~avi/IBAL. Avi Pfeffer
Author and contact point: Fionn Murtagh, fmurtagh @ astro.u-strasbg.fr