For a given unknown dataset, I was able to do very well because I spent most of my time on dummy variable and feature creation during the EDA process. I filled in all the gaps in terms of uncertainty and common multi class prediction workflow practices. I was able to get a F1 score of .95, precision of .96, and a recall of .96 with optimized random forest parameters.
Run the whole notebook and use the final model "rfc.predict" to apply the model to an unlabled dataset. Import the unlabeled dataset at the beginning and apply all the feature extractions in the code similar to the original df and then once the model is trained, apply that same model to the unlabeled dataset to get a new set of results
The deliverables of machine learning applications are towards price prediction for Sears and some NLP I did on reddit for spam detection bots
8/10, very familiar
I have been using numpy and pandas for 4 years.
scikit learn for 3 years, keras and tensorflow for 2
I used NLTK for the challange notebook. I am familiar with naive bayes bag of words and Stanford NLP
precision is to optimize true positives over the summation of true positives and false positives recall is to optimize true positives over the summation of true positives and false negatives F1 Score is the harmonic mean between precision and recall
ROC curve for learning rate, accuracy, F1 score, precision, recall, depending on consumer needs or problem statement
classifiers for decision trees, random forest, log regression
I've done prediction throughout my github https://github.com/yaowser/qtw
advantages is that everything can be optimized and that it can find the nuance of data with neurons, dropout rate, and layers. disadvantage is that it's a black box where important features are abstract
knn, multiple class classifications, geospatial clustering, sound separation
background on features, EDA, imputation, dummy variables, nuance variables and auxiliary variables, log transformations, data scaling, model selection and comparison, optimization of features, metrics, final model and results
I taught apache spark for databricks on youtube for 45 minutes to my class. I am very familar with it. bigtable, big query, some hadoop, aws, I have a certification in azure. video lecture (teach spark final project): https://youtu.be/IVMbSDS4q3A
Can you give examples of scenarios where machine learning is not the best option and where do you think we should apply machine learning?
machine learning is not the best option because you are only as good as your dataset. therefore, the collection of the data and the unbiased sampling procedure comes first before machine learning. After that, we need to read into the dataset to make sense of it before proceeding to model selection and the rest of the workflow. we can apply machine learning on prediction and classification problems based on proper data