Convert raw dataset to D3M dataset

The raw dataset converter requires python version >=3.6. It requires the core d3m package be installed. It currently supports the following data types:

Text
Video
Image
Time-Series
Tabular
Audio
Graph

Note: Some task/data type(s) may not be entirely automated (Eg., object detection, graph problems). TRAIN, TEST hierarchies specific to D3M datasets will be made available. However, datasetDoc.json might need to be customized for linking resources/tables for the specific task. For this purpose, example datasets are provided for reference purposes.

Interface

Command Line

create_d3m_dataset /path/to/train_data /path/to/test_data label metric -t task [-t ...] [-o /path/to/output_folder]

Python API

from autonml import createD3mDataset

training_data_csv = 'path/to/train/data'
testing_data_csv = 'path/to/test/data'
output_dir = 'path/to/formatted/data/output/directory'
label = 'Target'
metric = 'f1Macro'
tasks = ['classification']

createD3mDataset(training_data_csv,
                 testing_data_csv,
                 output_dir,
                 label,
                 metric,
                 tasks)

Usage and Explanation

Both the CLI command and the createD3mDataset API require 6 arguments:

Input Training Data : Path to input training data. Can be a file or a directory, depending on input data type. See Input data types below.
Input Testing Data : Path to input testing data. Can be a file or a directory, depending on input data type. See Input data types below.
Output directory : Path to output directory for storing the converted D3M dataset. Creates the directory if it does not exist, but the parent directory must exist.
Label : Column name which points to the targets. This must be consistent between training and testing data.
Metric : Metric for evaluation. See Metrics below.
Tasks : A list of tags that define the task type for AutonML. See Task types below.

Input data types

See detailed D3M dataset creation with image at image documentation
See detailed D3M dataset creation with video at video documentation
See detailed D3M dataset creation with forecasting at forecasting documentation

Creation of text D3M datasets follows a similar methodology to creation of image datasets, please refer to the documentation on image datasets above for a similar archetype to text datasets.

Sample D3M datasets for different data types:

audio: 31_urbansound_MIN_METADATA
video: LL1_VID_UCF11_MIN_METADATA
text: LL1_TXT_CLS_airline_opinion_MIN_METADATA
timeseries: 66_chlorineConcentration_MIN_METADATA
image: 22_handgeometry_MIN_METADATA
graph: 59_umls_MIN_METADATA <https://datasets.datadrivendiscovery.org/d3m/datasets/-/tree/master/seed_datasets_current/59_umls_MIN_METADATA>

Metrics

Valid metrics for different task types:

classification / linkPrediction / graphMatching / vertexNomination / vertexClassification: accuracy, f1Macro, f1Micro, rocAuc, rocAucMacro, rocAucMicro
regression / forecasting / collaborativeFiltering: rSquared, meanSquaredError, meanSquaredError, meanAbsoluteError
communityDetection / clustering: normalizedMutualInformation

Task types

Valid task types:

classification
regression
forecasting
collaborativeFiltering
communityDetection
graphMatching
linkPrediction
vertexClassification
vertexNomination
clustering
objectDetection
semiSupervised
remoteSensing

Sample D3M datasets for different task types:

classification: 185_baseball_MIN_METADATA
regression: 196_autoMpg_MIN_METADATA
forecasting: LL1_736_stock_market_MIN_METADATA
collaborativeFiltering: 60_jester_MIN_METADATA
communityDetection: 6_70_com_amazon_MIN_METADATA
graphMatching: 49_facebook_MIN_METADATA
linkPrediction: 59_umls_MIN_METADATA
vertexClassification: LL1_VTXC_1343_cora_MIN_METADATA

Examples

Some examples of valid commands are -

create_d3m_dataset train_data.csv test_data.csv output_dir Label accuracy -t classification
create_d3m_dataset train_data.csv test_data.csv output_dir Value meanSquaredError -t regression

Output

This script will create a directory structure “raw” for your dataset in D3M format. This dataset should be used as input to ./scripts/start_container.sh

This is the structure created for a generated D3M dataset:

raw$ tree
.
├── TEST
│   ├── dataset_TEST
│   │   ├── datasetDoc.json
│   │   ├── metadata.json
│   │   └── tables
│   │       └── learningData.csv
│   └── problem_TEST
│       └── problemDoc.json
└── TRAIN
    ├── dataset_TRAIN
    │   ├── datasetDoc.json
    │   ├── metadata.json
    │   └── tables
    │       └── learningData.csv
    └── problem_TRAIN
        └── problemDoc.json

8 directories, 8 files