|
78 | 78 | "```\n", |
79 | 79 | "The `HeadlineId` variable is special because it is a _key_ that links a particular headline to its words (a 1:n relation).\n", |
80 | 80 | "\n", |
81 | | - "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.*\n", |
| 81 | + "*Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only used for pedagogical purporses.*\n", |
82 | 82 | "\n", |
83 | | - "To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:" |
| 83 | + "To train a classifier with Khiops in this multi-table setup, this schema must be coded in a dictionary file. Let's check the contents of the `HeadlineSarcasm` dictionary file:" |
84 | 84 | ] |
85 | 85 | }, |
86 | 86 | { |
|
101 | 101 | "metadata": {}, |
102 | 102 | "source": [ |
103 | 103 | "As in the single-table case the `.kdic`file describes the schema for both tables, but note the following differences:\n", |
104 | | - "- The dictionary for the table `Headline` is prefixed by the `Root` keyword to indicate that is the main one.\n", |
105 | | - "- For both tables, their dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is the key of these tables.\n", |
106 | | - "- The schema for the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, is necessary to indicate the `1:n` relationship between the main and secondary table.\n", |
| 104 | + "- The dictionary for the table `Headline` is prefixed by the `Root` keyword. It is here optional and simply tags the main dictionary `Headline` representing the statistical instances.\n", |
| 105 | + "- For both tables, dictionary names are followed by `(HeadlineId)` to indicate that `HeadlineId` is their key.\n", |
| 106 | + "- The schema of the main table contains an extra special variable defined with the statement `Table(Words) HeadlineWords`. This is, in addition to sharing the same key variable, necessary to indicate the `1:n` relationship between the main and secondary table.\n", |
107 | 107 | "\n", |
108 | | - "Now let's store the location main and secondary tables and peek their contents:" |
| 108 | + "Now let's store the location of the main and secondary tables and peek their contents:" |
109 | 109 | ] |
110 | 110 | }, |
111 | 111 | { |
|
117 | 117 | "sarcasm_headlines_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"Headlines.txt\")\n", |
118 | 118 | "sarcasm_words_file = os.path.join(\"data\", \"HeadlineSarcasm\", \"HeadlineWords.txt\")\n", |
119 | 119 | "\n", |
120 | | - "print(f\"HeadlineSarcasm main table file: {sarcasm_headlines_file}\")\n", |
| 120 | + "print(f\"HeadlineSarcasm main table file location: {sarcasm_headlines_file}\")\n", |
121 | 121 | "print(\"\")\n", |
122 | 122 | "peek(sarcasm_headlines_file, n=3)\n", |
123 | 123 | "\n", |
|
133 | 133 | "The call to the `train_predictor` will be very similar to the single-table case but there are some differences. \n", |
134 | 134 | "\n", |
135 | 135 | "The first is that we must pass the path of the extra secondary data table. This is done with the `additional_data_tables` parameter that is a Python dictionary containing key-value pairs for each table. More precisely:\n", |
136 | | - "- keys describe *data paths* of secondary tables. In this case only ``Headline`HeadlineWords``\n", |
137 | | - "- values describe the *file paths* of secondary tables. In this case only the file path we stored in `sarcasm_words_file`\n", |
| 136 | + "- keys describe *data paths* of secondary tables. In this case only, it is ``HeadlineWords``\n", |
| 137 | + "- values describe the *file paths* of secondary tables. In this case only, it is the file path we stored in `sarcasm_words_file`\n", |
138 | 138 | "\n", |
139 | | - "*Note: For understanding what data paths are see the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n", |
| 139 | + "*Note: To understand what data paths are, please check the \"Multi-Table Tasks\" section of the Khiops `core.api` documentation*\n", |
140 | 140 | "\n", |
141 | | - "Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n", |
| 141 | + "Secondly, we must specify how many features/aggregates Khiops will create (at most) with its multi-table AutoML mode. For the `HeadlineSarcasm` dataset Khiops can create features such as:\n", |
142 | 142 | "- *Number of different words in the headline* \n", |
143 | 143 | "- *Most common word in the headline before the third one*\n", |
144 | 144 | "- *Number of times the word 'the' appears*\n", |
145 | 145 | "- ...\n", |
146 | 146 | "\n", |
147 | 147 | "It will then evaluate, select and combine the created features to build a classifier. We'll ask to create `1000` of these features (the default is `100`).\n", |
148 | 148 | "\n", |
149 | | - "With these considerations, let's setup the some extra variables and train the classifier:" |
| 149 | + "With these considerations, let's now train the classifier:" |
150 | 150 | ] |
151 | 151 | }, |
152 | 152 | { |
|
155 | 155 | "metadata": {}, |
156 | 156 | "outputs": [], |
157 | 157 | "source": [ |
158 | | - "sarcasm_results_dir = os.path.join(\"exercises\", \"HeadlineSarcasm\")\n", |
| 158 | + "analysis_report_file_path_Sarcasm = os.path.join(\n", |
| 159 | + " \"exercises\", \"HeadlineSarcasm\", \"AnalysisReport.khj\"\n", |
| 160 | + ")\n", |
159 | 161 | "\n", |
160 | 162 | "sarcasm_report, sarcasm_model_kdic = kh.train_predictor(\n", |
161 | 163 | " sarcasm_kdic,\n", |
162 | 164 | " dictionary_name=\"Headline\", # This must be the main/root dictionary\n", |
163 | 165 | " data_table_path=sarcasm_headlines_file, # This must be the data file for the main table\n", |
164 | 166 | " target_variable=\"IsSarcasm\",\n", |
165 | | - " results_dir=sarcasm_results_dir,\n", |
166 | | - " additional_data_tables={\"Headline`HeadlineWords\": sarcasm_words_file},\n", |
| 167 | + " analysis_report_file_path=analysis_report_file_path_Sarcasm,\n", |
| 168 | + " additional_data_tables={\"HeadlineWords\": sarcasm_words_file},\n", |
167 | 169 | " max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table\n", |
168 | 170 | " max_trees=0, # by default Khiops constructs 10 decision tree variables\n", |
169 | 171 | ")\n", |
|
192 | 194 | "cell_type": "markdown", |
193 | 195 | "metadata": {}, |
194 | 196 | "source": [ |
195 | | - "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops `sort_data_table` function or your favorite software. The examples of this tutorial have their tables pre-sorted.*" |
| 197 | + "*Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this, you may use the Khiops `sort_data_table` function. The examples of this tutorial have their tables pre-sorted.*" |
196 | 198 | ] |
197 | 199 | }, |
198 | 200 | { |
|
201 | 203 | "source": [ |
202 | 204 | "### Exercise time!\n", |
203 | 205 | "\n", |
204 | | - "Repeat the previous steps with the `AccidentsSummary` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n", |
| 206 | + "Repeat the previous steps with the `AccidentsSummary` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:\n", |
205 | 207 | "```\n", |
206 | 208 | "+---------------+\n", |
207 | 209 | "|Accidents |\n", |
|
220 | 222 | " +---1:n--->|... |\n", |
221 | 223 | " +---------------+\n", |
222 | 224 | "```\n", |
223 | | - "So for each accident we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n", |
| 225 | + "For each accident, we have its characteristics (such as `Gravity` or `Light` conditions) and those of each involved vehicle (its `Direction` or `PassengerNumber`). The main task for this dataset is to predict the variable `Gravity` that has two possible values:`Lethal` and `NonLethal`.\n", |
224 | 226 | "\n", |
225 | 227 | "We first save the paths of the `AccidentsSummary` dictionary file and data table files into variables:" |
226 | 228 | ] |
|
275 | 277 | "cell_type": "markdown", |
276 | 278 | "metadata": {}, |
277 | 279 | "source": [ |
278 | | - "We now save the results directory for this exercise:" |
| 280 | + "We now define the path of the modeling report for this exercise:" |
279 | 281 | ] |
280 | 282 | }, |
281 | 283 | { |
|
284 | 286 | "metadata": {}, |
285 | 287 | "outputs": [], |
286 | 288 | "source": [ |
287 | | - "accidents_results_dir = os.path.join(\"exercises\", \"AccidentSummary\")\n", |
288 | | - "print(f\"AccidentsSummary exercise results directory: {accidents_results_dir}\")" |
| 289 | + "analysis_report_file_path_Accidents = os.path.join(\n", |
| 290 | + " \"exercises\", \"AccidentSummary\", \"AnalysisReport.khj\"\n", |
| 291 | + ")" |
289 | 292 | ] |
290 | 293 | }, |
291 | 294 | { |
|
297 | 300 | "\n", |
298 | 301 | "Do not forget:\n", |
299 | 302 | "- The target variable is `Gravity`\n", |
300 | | - "- The key for the `additional_data_tables` parameter is ``Accident`Vehicles`` and its value that of `vehicles_data_file`\n", |
| 303 | + "- The key for the `additional_data_tables` parameter is ``Vehicles`` and its value that of `vehicles_data_file`\n", |
301 | 304 | "- Set `max_trees=0`" |
302 | 305 | ] |
303 | 306 | }, |
|
314 | 317 | " dictionary_name=\"Accident\",\n", |
315 | 318 | " data_table_path=accidents_data_file,\n", |
316 | 319 | " target_variable=\"Gravity\",\n", |
317 | | - " results_dir=accidents_results_dir,\n", |
318 | | - " additional_data_tables={\"Accident`Vehicles\": vehicles_data_file},\n", |
| 320 | + " analysis_report_file_path=analysis_report_file_path_Accidents,\n", |
| 321 | + " additional_data_tables={\"Vehicles\": vehicles_data_file},\n", |
319 | 322 | " max_constructed_variables=1000,\n", |
320 | 323 | " max_trees=0,\n", |
321 | 324 | ")\n", |
|
0 commit comments