LLaMA-Factory/data
hiyouga a9d1fb72f7 refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00
..
belle_multiturn add belle multiturn dataset 2023-06-16 20:01:16 +08:00
example_dataset restore from git lfs 2023-08-01 16:33:25 +08:00
hh_rlhf_en Initial commit 2023-05-28 18:09:04 +08:00
ultra_chat Initial commit 2023-05-28 18:09:04 +08:00
README.md refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00
README_zh.md refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00
alpaca_data_en_52k.json restore from git lfs 2023-08-01 16:33:25 +08:00
alpaca_data_zh_51k.json restore from git lfs 2023-08-01 16:33:25 +08:00
alpaca_gpt4_data_en.json restore from git lfs 2023-08-01 16:33:25 +08:00
alpaca_gpt4_data_zh.json restore from git lfs 2023-08-01 16:33:25 +08:00
comparison_gpt4_data_en.json restore from git lfs 2023-08-01 16:33:25 +08:00
comparison_gpt4_data_zh.json restore from git lfs 2023-08-01 16:33:25 +08:00
dataset_info.json refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00
lima.json restore from git lfs 2023-08-01 16:33:25 +08:00
oaast_rm.json restore from git lfs 2023-08-01 16:33:25 +08:00
oaast_rm_zh.json restore from git lfs 2023-08-01 16:33:25 +08:00
oaast_sft.json restore from git lfs 2023-08-01 16:33:25 +08:00
oaast_sft_zh.json restore from git lfs 2023-08-01 16:33:25 +08:00
self_cognition.json restore from git lfs 2023-08-01 16:33:25 +08:00
sharegpt_zh_27k.json restore from git lfs 2023-08-01 16:33:25 +08:00
wiki_demo.txt add pre-training script 2023-05-29 21:37:22 +08:00

README.md

If you are using a custom dataset, please provide your dataset definition in the following format in dataset_info.json.

"dataset_name": {
  "hf_hub_url": "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore below 2 arguments)",
  "file_name": "the name of the dataset file in the this directory. (required if above are not specified)",
  "file_sha1": "the SHA-1 hash value of the dataset file. (optional)",
  "ranking": "whether the examples contains ranked responses or not. (default: false)",
  "columns": {
    "prompt": "the name of the column in the datasets containing the prompts. (default: instruction)",
    "query": "the name of the column in the datasets containing the queries. (default: input)",
    "response": "the name of the column in the datasets containing the responses. (default: output)",
    "history": "the name of the column in the datasets containing the history of chat. (default: None)"
  }
}

where the prompt and response columns should contain non-empty values. The query column will be concatenated with the prompt column and used as input for the model. The history column should contain a list where each element is a string tuple representing a query-response pair.

For datasets used in reward modeling or DPO training, the response column should be a string list, with the preferred answers appearing first, for example:

{
  "instruction": "Question",
  "input": "",
  "output": [
    "Chosen answer",
    "Rejected answer"
  ]
}