Recipes Dataset
The RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.
Download and Unpack the Dataset
- Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
 - Accept Terms and Conditions and download zip file.
 - Option: Using the 
md5sum dataset.zipto validate the zip file and it should be equal to3a168dfd0912bb034225619b3586ce76. - Unpack the zip file with 
unzip dataset.zip. You will get thefull_dataset.csvfile in thedatasetdirectory. 
Create a Table
Run clickhouse-client and execute the following CREATE query:
CREATE TABLE recipes
(
    title String,
    ingredients Array(String),
    directions Array(String),
    link String,
    source LowCardinality(String),
    NER Array(String)
) ENGINE = MergeTree ORDER BY title;
Insert the Data
Run the following command:
clickhouse-client --query "
    INSERT INTO recipes
    SELECT
        title,
        JSONExtract(ingredients, 'Array(String)'),
        JSONExtract(directions, 'Array(String)'),
        link,
        source,
        JSONExtract(NER, 'Array(String)')
    FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
    FORMAT CSVWithNames
" --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv
This is a showcase how to parse custom CSV, as it requires multiple tunes.
Explanation:
- The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
 - The structure of CSV file is specified in the argument of the table function 
input; - The field 
num(row number) is unneeded - we parse it from file and ignore; - We use 
FORMAT CSVWithNamesbut the header in CSV will be ignored (by command line parameter--input_format_with_names_use_header 0), because the header does not contain the name for the first field; - File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the 
--format_csv_allow_single_quote 0parameter; - Some strings from CSV cannot parse, because they contain 
\M/sequence at the beginning of the value; the only value starting with backslash in CSV can be\Nthat is parsed as SQL NULL. We add--input_format_allow_errors_num 10parameter and up to ten malformed records can be skipped; - There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.
 
Validate the Inserted Data
By checking the row count:
Query:
SELECT count() FROM recipes;
Result:
┌─count()─┐
│ 2231142 │
└─────────┘
Example Queries
Top Components by the Number of Recipes:
In this example we learn how to use arrayJoin function to expand an array into a set of rows.
Query:
SELECT
    arrayJoin(NER) AS k,
    count() AS c
FROM recipes
GROUP BY k
ORDER BY c DESC
LIMIT 50
Result:
┌─k────────────────────┬──────c─┐
│ salt                 │ 890741 │
│ sugar                │ 620027 │
│ butter               │ 493823 │
│ flour                │ 466110 │
│ eggs                 │ 401276 │
│ onion                │ 372469 │
│ garlic               │ 358364 │
│ milk                 │ 346769 │
│ water                │ 326092 │
│ vanilla              │ 270381 │
│ olive oil            │ 197877 │
│ pepper               │ 179305 │
│ brown sugar          │ 174447 │
│ tomatoes             │ 163933 │
│ egg                  │ 160507 │
│ baking powder        │ 148277 │
│ lemon juice          │ 146414 │
│ Salt                 │ 122558 │
│ cinnamon             │ 117927 │
│ sour cream           │ 116682 │
│ cream cheese         │ 114423 │
│ margarine            │ 112742 │
│ celery               │ 112676 │
│ baking soda          │ 110690 │
│ parsley              │ 102151 │
│ chicken              │ 101505 │
│ onions               │  98903 │
│ vegetable oil        │  91395 │
│ oil                  │  85600 │
│ mayonnaise           │  84822 │
│ pecans               │  79741 │
│ nuts                 │  78471 │
│ potatoes             │  75820 │
│ carrots              │  75458 │
│ pineapple            │  74345 │
│ soy sauce            │  70355 │
│ black pepper         │  69064 │
│ thyme                │  68429 │