CMU TA2 (Built using DARPA D3M ecosystem)
Auton ML is an automated machine learning system developed by CMU Auton Lab to power data scientists with efficient model discovery and advanced data analytics. Auton ML also powers the D3M Subject Matter Expert (SME) User Interfaces such as Two Ravens http://2ra.vn/.
Taking your machine learning capacity to the nth power.
We provide a documentation listing the complete set of tasks, data modalities, machine learning models and future supported tasks provided by AutonML here.
Installation
AutonML can be installed as: pip install autonml. We recommend this
installation be done in a new virtual environment or conda environment, using python 3.8
Recommended steps to install autonml:
conda create -n <yourenvnamehere> python=3.8
conda activate <yourenvnamehere>
pip install autonml
pip install d3m-common-primitives d3m-sklearn-wrap sri-d3m rpi-d3m-primitives dsbox-primitives dsbox-corex distil-primitives autonbox d3m-jhu-primitives kf-d3m-primitives d3m-esrnn d3m-nbeats --no-binary pmdarima
This installation may take time to complete, owing to the fact that
pip’s dependecy resolvers may take time resolving potential package
conflicts. To make installation faster, you can add pip’s legacy
resolver as --use-deprecated=legacy-resolver. Caution: using old
resolvers may present unresolved package conflicts.
D3M dataset
Any dataset to be used should be in D3M dataset format (directory structure with TRAIN, TEST folders and underlying .json files).
Example available of a single dataset here
More datasets available here
Any non-D3M data can be converted to D3M dataset. (See section below on “Convert raw dataset to D3M dataset”).
Run the AutonML pipeline
We can run the AutonML pipeline in two ways.
Command Line Interface
AutonML can be run from the command line via the autonml_main command. This command takes five arguments, listed below:
Run type (fit/fit-produce/produce)
Path to the data directory (must be in D3M format)
Output directory where results are to be stored. This directory will be dynamically created if it does not exist.
Timeout (measured in minutes)
Number of CPUs to be used (minimum: 4 cores, recommended: 8 cores)
RUN_TYPE=fit-produce
INPUT_DIR=/home/<user>/d3m/datasets/185_baseball_MIN_METADATA
OUTPUT_DIR=/output
TIMEOUT=2
NUMCPUS=8
autonml_main ${RUN_TYPE} ${INPUT_DIR} ${OUTPUT_DIR} ${TIMEOUT} ${NUMCPUS}
API
AutonML can also be accessed via a simple Python API, as shown in the command below.
from autonml import AutonML
aml = AutonML(input_dir='/path/to/input/d3mdataset',
output_dir='/path/to/store/results',
timeout=10,
numcpus=8)
aml.run()
Outputs
Running AutonML using either of the above scripts will do the following-
Run search for best pipelines for the specified dataset using
TRAINdata.JSON pipelines (with ranks) will be output in JSON format at
/output/pipelines_ranked/.CSV prediction files for each pipeline trained on
TRAINdata and predicted onTESTdata will be available at/output/predictions/Training data predictions for each pipeline are produced and saved at
/output/training_predictions/<pipeliname>_train_predictions.csv.Python code equivalent of executing each pipeline on the dataset is saved at
/output/executables/. Also includes Jupyter Notebook equivalent of running each pipeline on the dataset.
An example -
OUTPUT_DIR=output
python ${OUTPUT_DIR}/99211bc3-638a-455b-8d48-0dadc0bf1f10/executables/19908fd3-706a-48da-b13c-dc13da0ed3cc.code.py ${OUTPUT_DIR}/ ${OUTPUT_DIR}/99211bc3-638a-455b-8d48-0dadc0bf1f10/predictions/19908fd3-706a-48da-b13c-dc13da0ed3cc.predictions.csv
Examples
You can find example notebooks for various supported datasets here.
.