Convert raw dataset to D3M dataset
The raw dataset converter requires python version >=3.6. It requires the core d3m package
be installed. It currently supports the following data types:
Text
Video
Image
Time-Series
Tabular
Audio
Graph
Note: Some task/data type(s) may not be entirely automated (Eg., object detection, graph problems). TRAIN, TEST hierarchies specific to D3M datasets will be made available. However, datasetDoc.json might need to be customized for linking resources/tables for the specific task. For this purpose, example datasets are provided for reference purposes.
Interface
Command Line
create_d3m_dataset /path/to/train_data /path/to/test_data label metric -t task [-t ...] [-o /path/to/output_folder]
Python API
from autonml import createD3mDataset
training_data_csv = 'path/to/train/data'
testing_data_csv = 'path/to/test/data'
output_dir = 'path/to/formatted/data/output/directory'
label = 'Target'
metric = 'f1Macro'
tasks = ['classification']
createD3mDataset(training_data_csv,
testing_data_csv,
output_dir,
label,
metric,
tasks)
Usage and Explanation
Both the CLI command and the createD3mDataset API require 6 arguments:
Input Training Data : Path to input training data. Can be a file or a directory, depending on input data type. See Input data types below.
Input Testing Data : Path to input testing data. Can be a file or a directory, depending on input data type. See Input data types below.
Output directory : Path to output directory for storing the converted D3M dataset. Creates the directory if it does not exist, but the parent directory must exist.
Label : Column name which points to the targets. This must be consistent between training and testing data.
Metric : Metric for evaluation. See Metrics below.
Tasks : A list of tags that define the task type for AutonML. See Task types below.
Input data types
See detailed D3M dataset creation with
imageat image documentationSee detailed D3M dataset creation with
videoat video documentationSee detailed D3M dataset creation with
forecastingat forecasting documentation
Creation of text D3M datasets follows a similar methodology to creation of image datasets,
please refer to the documentation on image datasets above for a similar archetype to text datasets.
Sample D3M datasets for different data types:
audio: 31_urbansound_MIN_METADATAvideo: LL1_VID_UCF11_MIN_METADATAtimeseries: 66_chlorineConcentration_MIN_METADATAimage: 22_handgeometry_MIN_METADATAgraph: 59_umls_MIN_METADATA <https://datasets.datadrivendiscovery.org/d3m/datasets/-/tree/master/seed_datasets_current/59_umls_MIN_METADATA>
Metrics
- Valid metrics for different task types:
classification / linkPrediction / graphMatching / vertexNomination / vertexClassification: accuracy, f1Macro, f1Micro, rocAuc, rocAucMacro, rocAucMicro
regression / forecasting / collaborativeFiltering: rSquared, meanSquaredError, meanSquaredError, meanAbsoluteError
communityDetection / clustering: normalizedMutualInformation
Task types
- Valid task types:
classification
regression
forecasting
collaborativeFiltering
communityDetection
graphMatching
linkPrediction
vertexClassification
vertexNomination
clustering
objectDetection
semiSupervised
remoteSensing
Sample D3M datasets for different task types:
classification: 185_baseball_MIN_METADATAregression: 196_autoMpg_MIN_METADATAforecasting: LL1_736_stock_market_MIN_METADATAcollaborativeFiltering: 60_jester_MIN_METADATAcommunityDetection: 6_70_com_amazon_MIN_METADATAgraphMatching: 49_facebook_MIN_METADATAlinkPrediction: 59_umls_MIN_METADATAvertexClassification: LL1_VTXC_1343_cora_MIN_METADATA
Examples
Some examples of valid commands are -
create_d3m_dataset train_data.csv test_data.csv output_dir Label accuracy -t classification
create_d3m_dataset train_data.csv test_data.csv output_dir Value meanSquaredError -t regression
Output
This script will create a directory structure “raw” for your dataset in D3M format. This dataset should be used as input to ./scripts/start_container.sh
This is the structure created for a generated D3M dataset:
raw$ tree
.
├── TEST
│ ├── dataset_TEST
│ │ ├── datasetDoc.json
│ │ ├── metadata.json
│ │ └── tables
│ │ └── learningData.csv
│ └── problem_TEST
│ └── problemDoc.json
└── TRAIN
├── dataset_TRAIN
│ ├── datasetDoc.json
│ ├── metadata.json
│ └── tables
│ └── learningData.csv
└── problem_TRAIN
└── problemDoc.json
8 directories, 8 files