KhiopsML
diff --git a/‎Core Basics 1 - Train, Evaluate and Deploy a Classifier.ipynb‎
Lines changed: 30 additions & 25 deletions b/‎Core Basics 1 - Train, Evaluate and Deploy a Classifier.ipynb‎
Lines changed: 30 additions & 25 deletions
diff --git a/‎Core Basics 2 - Train a Classifier on a Star Multi-Table Dataset.ipynb‎
Lines changed: 27 additions & 24 deletions b/‎Core Basics 2 - Train a Classifier on a Star Multi-Table Dataset.ipynb‎
Lines changed: 27 additions & 24 deletions
@@ -32,7 +32,7 @@
     "    print(\"\")\n",
     "\n",
     "\n",
-    "# If there are any issues you may Khiops status with the following command\n",
+    "# If there are any issues, you may print Khiops status with the following command:\n",
     "# kh.get_runner().print_status()"
    ]
   },
@@ -43,12 +43,12 @@
     "## Training a Classifier\n",
     "We'll train a classifier for the `Iris` dataset. This is a classical dataset containing the data of different plants belonging to the genus _Iris_. It contains 150 records, 50 for each of three variants of _Iris_: _Setosa_, _Virginica_ and _Versicolor_. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of _Iris_ taking as inputs the length and width characteristics.\n",
     "\n",
-    "Now to train a classifier with Khiops we use two types of files:\n",
+    "Now to train a classifier with Khiops, we use two types of files:\n",
     "- A plain-text delimited data file (for example a `csv` file)\n",
     "- A _dictionary_ file which describes the schema of the above data table (`.kdic` file extension)\n",
     "\n",
     "\n",
-    "Let's save into variables the locations of these files for the `Iris` dataset and then take a look at their contents:"
+    "Let's save, into variables, the locations of these files for the `Iris` dataset and then take a look at their contents:"
    ]
   },
   {
@@ -70,7 +70,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that the _Iris_ variant information is in the column `Class`. Now let's specify directory to save our results:"
+    "Note that the _Iris_ variant information is in the column `Class`. Now let's specify the path to the analysis report file."
    ]
   },
   {
@@ -79,17 +79,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "iris_results_dir = os.path.join(\"exercises\", \"Iris\")\n",
-    "print(f\"Iris results directory: {iris_results_dir}\")"
+    "analysis_report_file_path_Iris = os.path.join(\"exercises\", \"Iris\", \"AnalysisReport.khj\")\n",
+    "\n",
+    "print(f\"Iris analysis report file path: {analysis_report_file_path_Iris}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We are now ready to train the classifier with the Khiops function `train_predictor`. This method returns a tuple containing the location of two files:\n",
-    "- the modeling report (`AllReports.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics.\n",
-    "- model's _dictionary_ file (`Modeling.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
+    "- the modeling report (`AnalysisReport.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into `analysis_report_file_path_Iris` variable that we just defined.\n",
+    "- model's _dictionary_ file (`AnalysisReport.model.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data."
    ]
   },
   {
@@ -103,7 +104,7 @@
     "    dictionary_name=\"Iris\",\n",
     "    data_table_path=iris_data_file,\n",
     "    target_variable=\"Class\",\n",
-    "    results_dir=iris_results_dir,\n",
+    "    analysis_report_file_path=analysis_report_file_path_Iris,\n",
     "    max_trees=0,  # by default Khiops constructs 10 decision tree variables\n",
     ")\n",
     "print(f\"Iris report file: {iris_report}\")\n",
@@ -114,7 +115,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can verify that the result files were created in `iris_results_dir`. In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can see the report with the Khiops Visualization app:"
+    "Note that `iris_report` (the first element of the tuple returned by train_predictor) is identical to `analysis_report_file_path_Iris`. \n",
+    "\n",
+    "In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can have a look at the report with the Khiops Visualization app:"
    ]
   },
   {
@@ -133,9 +136,9 @@
    "source": [
     "### Exercise\n",
     "\n",
-    "We'll repeat the examples on this notebook with the `Adult` dataset. It contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
+    "We'll repeat the previous steps on the `Adult` dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.\n",
     "\n",
-    "Let's start by putting into variables the paths for the `Adult` dataset:"
+    "Let's start by putting, into variables, the paths for the `Adult` dataset:"
    ]
   },
   {
@@ -173,7 +176,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We now save the results directory for this exercise:"
+    "We now specify the path to the analysis report file for this exercise:"
    ]
   },
   {
@@ -182,16 +185,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "adult_results_dir = os.path.join(\"exercises\", \"Adult\")\n",
-    "print(f\"Adult results directory: {adult_results_dir}\")"
+    "analysis_report_file_path_Adult = os.path.join(\n",
+    "    \"exercises\", \"Adult\", \"AnalysisReport.khj\"\n",
+    ")\n",
+    "\n",
+    "print(f\"Adult analysis report file path: {analysis_report_file_path_Adult}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "#### Train a classifier for the `Adult` database\n",
-    "Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them"
+    "Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them."
    ]
   },
   {
@@ -207,7 +213,7 @@
     "    dictionary_name=\"Adult\",\n",
     "    data_table_path=adult_data_file,\n",
     "    target_variable=\"class\",\n",
-    "    results_dir=adult_results_dir,\n",
+    "    analysis_report_file_path=analysis_report_file_path_Adult,\n",
     "    max_trees=0,\n",
     ")\n",
     "print(f\"Adult report file: {adult_report}\")\n",
@@ -239,7 +245,7 @@
    "source": [
     "## Accessing a Classifiers' Basic Evaluation Metrics\n",
     "\n",
-    "We access the classifier's evaluation metrics by loading file at `iris_report` file with the Khiops function `read_analysis_results_file`:"
+    "We access the classifier's evaluation metrics by loading the file at `iris_report` with the Khiops function `read_analysis_results_file`:"
    ]
   },
   {
@@ -292,7 +298,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "These objects are of class `PredictorPerformance`  and have `accuracy` and `auc` attributes for these metrics:"
+    "These objects are of class `PredictorPerformance`. They have access to `accuracy` and `auc` attributes:"
    ]
   },
   {
@@ -376,7 +382,7 @@
    "metadata": {},
    "source": [
     "## Deploying a Classifier\n",
-    "We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Just this time let's take a quick look at its contents:"
+    "We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let's take a quick look at its contents:"
    ]
   },
   {
@@ -392,12 +398,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that the modeling dictionary contains 5 used variables:\n",
-    "- `Class` : The original target of the dataset\n",
+    "Note that the modeling dictionary contains 4 used variables:\n",
     "- `PredictedClass` : The class with the highest probability according to the model\n",
     "- `ProbClassIris-setosa`, `ProbClassIris-versicolor`, `ProbClassIris-virginica`: The probabilities of each class according to the model\n",
     "\n",
-    "These will be the columns of the output table when deploying the model:"
+    "These will be the columns of the table obtained after deploying the model. This table will be saved at `iris_deployment_file`."
    ]
   },
   {
@@ -406,7 +411,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "iris_deployment_file = os.path.join(iris_results_dir, \"iris_deployment.txt\")\n",
+    "iris_deployment_file = os.path.join(\"exercises\", \"Iris\", \"iris_deployment.txt\")\n",
     "kh.deploy_model(\n",
     "    iris_model_kdic,\n",
     "    dictionary_name=\"SNB_Iris\",\n",
@@ -434,7 +439,7 @@
    },
    "outputs": [],
    "source": [
-    "adult_deployment_file = os.path.join(adult_results_dir, \"adult_deployment.txt\")\n",
+    "adult_deployment_file = os.path.join(\"exercises\", \"Adult\", \"adult_deployment.txt\")\n",
     "kh.deploy_model(\n",
     "    adult_model_kdic,\n",
     "    dictionary_name=\"SNB_Adult\",\n",
 
@@ -78,9 +78,9 @@
     "```\n",
     "The `HeadlineId` variable is special because it is a _key_ that links a particular headline to its words (a 1:n relation).\n",
     "\n",
-    "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.*\n",
+    "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only used for pedagogical purporses.*\n",
     "\n",
-    "To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
+    "To train a classifier with Khiops in this multi-table setup, this schema must be coded in a dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:"
    ]
   },
   {
@@ -101,11 +101,11 @@
    "metadata": {},
    "source": [
     "As in the single-table case the `.kdic`file describes the schema for both tables, but note the following differences:\n",
-    "- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that is the main one.\n",
-    "- For both tables, their dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is the key of these tables.\n",
-    "- The schema for the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, is necessary to indicate the `1:n` relationship between the main and secondary table.\n",
+    "- The dictionary for the table `Headline` is prefixed by the `Root` keyword. It is here optional and simply tags the main dictionary `Headline` representing the statistical instances.\n",
+    "- For both tables, dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is their key.\n",
+    "- The schema of the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, necessary to indicate the `1:n` relationship between the main and secondary table.\n",
     "\n",
-    "Now let's store the location main and secondary tables and peek their contents:"
+    "Now let's store the location of the main and secondary tables and peek their contents:"
    ]
   },
   {
@@ -117,7 +117,7 @@
     "sarcasm_headlines_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"Headlines.txt\")\n",
     "sarcasm_words_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"HeadlineWords.txt\")\n",
     "\n",
-    "print(f\"HeadlineSarcasm main table file: {sarcasm_headlines_file}\")\n",
+    "print(f\"HeadlineSarcasm main table file location: {sarcasm_headlines_file}\")\n",
     "print(\"\")\n",
     "peek(sarcasm_headlines_file, n=3)\n",
     "\n",
@@ -133,20 +133,20 @@
     "The call to the `train_predictor` will be very similar to the single-table case but there are some differences. \n",
     "\n",
     "The first is that we must pass the path of the extra secondary data table. This is done with the `additional_data_tables` parameter that is a Python dictionary containing key-value pairs for each table. More precisely:\n",
-    "- keys describe *data paths* of secondary tables. In this case only ``Headline`HeadlineWords``\n",
-    "- values describe the *file paths* of secondary tables. In this case only the file path we stored in `sarcasm_words_file`\n",
+    "- keys describe *data paths* of secondary tables. In this case only, it is ``HeadlineWords``\n",
+    "- values describe the *file paths* of secondary tables. In this case only, it is the file path we stored in `sarcasm_words_file`\n",
     "\n",
-    "*Note: For understanding what data paths are see the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
+    "*Note: To understand what data paths are, please check the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n",
     "\n",
-    "Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
+    "Secondly, we must specify how many features/aggregates Khiops will create (at most) with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n",
     "- *Number of different words in the headline* \n",
     "- *Most common word in the headline before the third one*\n",
     "- *Number of times the word 'the' appears*\n",
     "- ...\n",
     "\n",
     "It will then evaluate, select and combine the created features to build a classifier. We'll ask to create `1000` of these features (the default is `100`).\n",
     "\n",
-    "With these considerations, let's setup the some extra variables and train the classifier:"
+    "With these considerations, let's now train the classifier:"
    ]
   },
   {
@@ -155,15 +155,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sarcasm_results_dir = os.path.join(\"exercises\", \"HeadlineSarcasm\")\n",
+    "analysis_report_file_path_Sarcasm = os.path.join(\n",
+    "    \"exercises\", \"HeadlineSarcasm\", \"AnalysisReport.khj\"\n",
+    ")\n",
     "\n",
     "sarcasm_report, sarcasm_model_kdic = kh.train_predictor(\n",
     "    sarcasm_kdic,\n",
     "    dictionary_name=\"Headline\",  # This must be the main/root dictionary\n",
     "    data_table_path=sarcasm_headlines_file,  # This must be the data file for the main table\n",
     "    target_variable=\"IsSarcasm\",\n",
-    "    results_dir=sarcasm_results_dir,\n",
-    "    additional_data_tables={\"Headline`HeadlineWords\": sarcasm_words_file},\n",
+    "    analysis_report_file_path=analysis_report_file_path_Sarcasm,\n",
+    "    additional_data_tables={\"HeadlineWords\": sarcasm_words_file},\n",
     "    max_constructed_variables=1000,  # by default Khiops constructs 100 variables for AutoML multi-table\n",
     "    max_trees=0,  # by default Khiops constructs 10 decision tree variables\n",
     ")\n",
@@ -192,7 +194,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*"
+    "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this, you may use the Khiops `sort_data_table` function. The examples of this tutorial have their tables pre-sorted.*"
    ]
   },
   {
@@ -201,7 +203,7 @@
    "source": [
     "### Exercise time!\n",
     "\n",
-    "Repeat the previous steps with the `AccidentsSummary` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
+    "Repeat the previous steps with the `AccidentsSummary` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n",
     "```\n",
     "+---------------+\n",
     "|Accidents      |\n",
@@ -220,7 +222,7 @@
     "       +---1:n--->|...            |\n",
     "                  +---------------+\n",
     "```\n",
-    "So for each accident we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
+    "For each accident, we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n",
     "\n",
     "We first save the paths of the `AccidentsSummary` dictionary file and data table files into variables:"
    ]
@@ -275,7 +277,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We now save the results directory for this exercise:"
+    "We now define the path of the modeling report for this exercise:"
    ]
   },
   {
@@ -284,8 +286,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "accidents_results_dir = os.path.join(\"exercises\", \"AccidentSummary\")\n",
-    "print(f\"AccidentsSummary exercise results directory: {accidents_results_dir}\")"
+    "analysis_report_file_path_Accidents = os.path.join(\n",
+    "    \"exercises\", \"AccidentSummary\", \"AnalysisReport.khj\"\n",
+    ")"
    ]
   },
   {
@@ -297,7 +300,7 @@
     "\n",
     "Do not forget:\n",
     "- The target variable is `Gravity`\n",
-    "- The key for the `additional_data_tables` parameter is ``Accident`Vehicles`` and its value that of `vehicles_data_file`\n",
+    "- The key for the `additional_data_tables` parameter is ``Vehicles`` and its value that of `vehicles_data_file`\n",
     "- Set `max_trees=0`"
    ]
   },
@@ -314,8 +317,8 @@
     "    dictionary_name=\"Accident\",\n",
     "    data_table_path=accidents_data_file,\n",
     "    target_variable=\"Gravity\",\n",
-    "    results_dir=accidents_results_dir,\n",
-    "    additional_data_tables={\"Accident`Vehicles\": vehicles_data_file},\n",
+    "    analysis_report_file_path=analysis_report_file_path_Accidents,\n",
+    "    additional_data_tables={\"Vehicles\": vehicles_data_file},\n",
     "    max_constructed_variables=1000,\n",
     "    max_trees=0,\n",
     ")\n",