Aπ'ολα έχει: RESTful web service, machine learning, data mining, και Google. Δείτε και τα σχόλια. Είναι κανενός μηνός νέο, αλλά δεν το είδα να έχει ποσταριστεί.
The recent announcement of the Google Prediction API caught my attention. The service is interesting in that it is a business model that focuses on providing a scalable machine learning black box that can be used directly or integrated into an application. The services may work by the user first uploading a dataset to the also newly announced Google storage service, training an opaque model from the data, and deriving predictions from the prepared model.
One interacts with the service using a RESTful API, performing POST and GET HTTP operations in order to invoke training and prediction functions. Data must be provided in a CSV format of comma separated line-based records. Training seems to be per data-bucket and it is unclear whether models can be updated once trained, whether the models can be retrieved, or even what types of machine learning algorithms and algorithm parameters will be used. The service describes only support for supervised classification tasks at this stage.
Data types are limited for now to categorical prediction (classification) with real and textual inputs. Naturally (this is google) data records can be comprised of very long lists of attributes and dataset sizes can be enormous. The status of model training can be queried and some basic statistics from the trained model can be retried - classification accuracy which is determined using cross validation on the provided training data. Predictions can be made using a query interface passing in the input attributes and retrieving the classification. Presumably there is a batch mode where multiple records could be passed in for classification.
All this information was distilled from the service page and is more than likely to change. The service is not available yet, but I signed up to the waiting list of people to get early access burning it in. Billing will likely converge to a factor of storage size, maybe even model compute time, and the volume of retrieved predictions.
To me it feels like they have abstracted the process used to build the language translation service or spell checker/corrector, simplified it, and are turning it into a commodity. Big data, rather than fancy algorithms is 'where it is at' (see Norvig's Theorizing from Data from 2007).
The service is loosely related to two other services out in the wild. The first is TunedIT which is an algorithm/dataset/challenge website launched in September 2009 (see a press release). The site allows the uploading of data sets and/or algorithms and more importantly the design of data set challenges like the Netflix Prize. This seems to be the primary function of the site and to me it is trying to exploit the success of the Netfix Prize by abstracting it and providing the management of such challenges as a service (not a terrible idea). The other site is MLcomp which launched in April 2010 (see a press release) and is focused on users either uploading datasets to find the algorithm that performs the best, or to upload algorithms and have the system automatically evaluate it against all previously uploaded datasets. To me, it feels like an online version of the WEKA machine learning workbench (not a terrible idea if your market is other grad students). Both sites are really focused at machine learning practitioners, and unlike the announced Google service don't seem to offer a useful way to exploit the algorithms for private data sources.
I had some similar ideas to this while studying as a graduate student, although I had lofty scientific ambitions of automatically mapping the performance of a large suite of function optimization algorithms rather than function approximation machine learning algorithms - something like an optimization version of MLcomp. I even blogged a little on it after I completed my dissertation (see Mapping 'no free lunch'). The targeted value proposition in the Google prediction service is an excellent approach, and people may even pay to use it.
Although the algorithm hackers researchers will want to know all about the algorithms and parameterization of said algorithms, I hope that the service remains a black box (shock, horror!). Maybe not, but I would hate to see this devolve into an algorithm free-for-all that would confuse users and muddy the value this service could deliver. With Google-level infrastructure, they can run a suite of the top 20 techniques for a given problem type and deliver the best (or an ensemble) to provide the predictions and keep the specific details of the magic that produced the model a secret. That is what I would do. And if this is indeed the adopted strategy, then I doubt we will see a "download model" API call anytime soon.