IBM Parallel Machine Learning Toolbox

A toolbox for running machine learning algorithms on parallel computing platforms. Large data sets are common in Web applications, bioinformatics, and speech and image processing. Many sophisticated machine learning algorithms cannot process such large amounts of data on a single node. IBM Parallel Machine Learning Toolbox (PML) can do so by distributing the computations. This distribution speeds up computations and expedites training by several orders of magnitude: for example, from several weeks on a single node to days or even hours running on multiple nodes. The toolbox enables the application of machine learning tools to large data sets by distributing the required computations to computing nodes in a parallel fashion. The toolbox can work on various types of architecture, from multi-core machines to BlueGene. PML contains many commonly-used machine learning algorithms and includes an API for incorporating additional algorithms. Standard supported algorithms include the following: Classification: Support-vector machine (SVM), linear least squares, and transform regression Clustering: k-means and fuzzy k-means Feature reduction: Principal Component Analysis (PCA) and kernel PCA. The toolbox runs on Windows, Linux, and UNIX. How does it work? PML can be used in two modes: The built-in algorithms can be run using a simple textual interface. Users specify the location of the data and then select the algorithm and parameters to run. The output is provided in a text file. Algorithms can be added by making use of a simple API. This mode makes it possible for researchers to test their own algorithms, using PML as the basis for distributing the computations in an easy, reliable way. PML uses the standard MPICH2 library for low-level communications. Use of this library means that PML can be run on widely-varying types of architecture, such as a single-node machine, small clusters, grids, and BlueGene. After being initialized, the toolbox allows for computations to be distributed to multiple computing nodes and results returned to a master node, which then conducts the necessary updates. These updated results are then returned to the computing nodes, and this process is repeated several times until results converge based on the pre-specified parameters. Visit: its web site.