{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# 基于机器学习的情感分析\n", "\n", "\n", "![image.png](images/author.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "## Emotion\n", "Different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using different algorithms: e.g., naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.\n", "\n", "\n", "## Polarity\n", "\n", "To classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](images/tweet.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sentiment Analysis with Sklearn\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:17:46.275892Z", "start_time": "2020-10-30T12:17:45.503869Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:01.704223Z", "start_time": "2020-10-30T12:18:01.698696Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pos_tweets = [('I love this car', 'positive'),\n", " ('This view is amazing', 'positive'),\n", " ('I feel great this morning', 'positive'),\n", " ('I am so excited about the concert', 'positive'),\n", " ('He is my best friend', 'positive')]\n", "\n", "neg_tweets = [('I do not like this car', 'negative'),\n", " ('This view is horrible', 'negative'),\n", " ('I feel tired this morning', 'negative'),\n", " ('I am not looking forward to the concert', 'negative'),\n", " ('He is my enemy', 'negative')]\n", "\n", "test_tweets = [\n", " ('feel happy this morning', 'positive'),\n", " ('larry is my friend', 'positive'),\n", " ('I do not like that man', 'negative'),\n", " ('house is not great', 'negative'),\n", " ('your song is annoying', 'negative')]\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:15.684528Z", "start_time": "2020-10-30T12:18:15.680710Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "dat = []\n", "for i in pos_tweets+neg_tweets+test_tweets:\n", " dat.append(i)\n", " \n", "X = np.array(dat).T[0]\n", "y = np.array(dat).T[1]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:18.112608Z", "start_time": "2020-10-30T12:18:18.060808Z" } }, "outputs": [], "source": [ "TfidfVectorizer?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:19:10.532412Z", "start_time": "2020-10-30T12:19:10.519832Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "vec = TfidfVectorizer(stop_words='english', ngram_range = (1, 1), lowercase = True)\n", "X_vec = vec.fit_transform(X)\n", "Xtrain = X_vec[:10]\n", "Xtest = X_vec[10:]\n", "ytrain = y[:10]\n", "ytest= y[10:] " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:19:16.575565Z", "start_time": "2020-10-30T12:19:16.536320Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "\n", " | amazing | \n", "annoying | \n", "best | \n", "car | \n", "concert | \n", "enemy | \n", "excited | \n", "feel | \n", "forward | \n", "friend | \n", "... | \n", "house | \n", "larry | \n", "like | \n", "looking | \n", "love | \n", "man | \n", "morning | \n", "song | \n", "tired | \n", "view | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.655648 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
1 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.655648 | \n", "
2 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.554219 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.554219 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
3 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.655648 | \n", "0.0 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
4 | \n", "0.000000 | \n", "0.000000 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.655648 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
5 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.707107 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.707107 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
6 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.655648 | \n", "
7 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.522329 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.522329 | \n", "0.000000 | \n", "0.67405 | \n", "0.000000 | \n", "
8 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.523243 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.602585 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.602585 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
9 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
10 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.522329 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.522329 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
11 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.655648 | \n", "... | \n", "0.000000 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
12 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.655648 | \n", "0.000000 | \n", "0.000000 | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
13 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.755067 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "
14 | \n", "0.000000 | \n", "0.707107 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.707107 | \n", "0.00000 | \n", "0.000000 | \n", "
15 rows × 23 columns
\n", "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.318532 secs." ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.318532 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.499892 secs." ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.499892 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "movies_reviews_data = tc.SFrame.read_csv(traindata_path,header=True, \n", " delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, \n", " 'sentiment' : str, \n", " 'review':str } )" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By using the SFrame show function, we can visualize the data and notice that the train dataset consists of 12,500 positive and 12,500 negative, and overall 24,932 unique reviews." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:50:55.267343Z", "start_time": "2019-06-14T16:50:55.212701Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
id | \n", "sentiment | \n", "review | \n", "
---|---|---|
5814_8 | \n", "1 | \n", "With all this stuff going down at the moment with ... | \n",
"
2381_9 | \n", "1 | \n", ""The Classic War of the Worlds" by Timothy Hines ... | \n",
"
7759_3 | \n", "0 | \n", "The film starts with a manager (Nicholas Bell) ... | \n",
"
3630_4 | \n", "0 | \n", "It must be assumed that those who praised this ... | \n",
"
9495_8 | \n", "1 | \n", "Superbly trashy and wondrously unpretentious ... | \n",
"
8196_8 | \n", "1 | \n", "I dont know why people think this is such a bad ... | \n",
"
7166_2 | \n", "0 | \n", "This movie could have been very good, but c ... | \n",
"
10633_1 | \n", "0 | \n", "I watched this video at a friend's house. I'm glad ... | \n",
"
319_1 | \n", "0 | \n", "A friend of mine bought this film for £1, and ... | \n",
"
8713_10 | \n", "1 | \n", "<br /><br />This movie is full of references. Like ... | \n",
"
id | \n", "sentiment | \n", "review | \n", "1grams features | \n", "
---|---|---|---|
5814_8 | \n", "1 | \n", "With all this stuff going down at the moment with ... | \n",
" {'just': 3, 'sickest': 1, 'smooth': 1, 'this': 11, ... | \n",
"
2381_9 | \n", "1 | \n", ""The Classic War of the Worlds" by Timothy Hines ... | \n",
" {'year': 1, 'others': 1, 'those': 2, 'this': 1, ... | \n",
"
7759_3 | \n", "0 | \n", "The film starts with a manager (Nicholas Bell) ... | \n",
" {'hair': 1, 'bound': 1, 'this': 1, 'when': 2, ... | \n",
"
3630_4 | \n", "0 | \n", "It must be assumed that those who praised this ... | \n",
" {'crocuses': 1, 'that': 7, 'batonzilla': 1, ... | \n",
"
9495_8 | \n", "1 | \n", "Superbly trashy and wondrously unpretentious ... | \n",
" {'unshaven': 1, 'just': 1, 'in': 5, 'when': 2, ... | \n",
"
8196_8 | \n", "1 | \n", "I dont know why people think this is such a bad ... | \n",
" {'harry': 3, 'this': 4, 'of': 2, 'hurt': 1, ' ... | \n",
"
7166_2 | \n", "0 | \n", "This movie could have been very good, but c ... | \n",
" {'acting': 1, 'background': 1, 'just': ... | \n",
"
10633_1 | \n", "0 | \n", "I watched this video at a friend's house. I'm glad ... | \n",
" {'photography': 1, 'others': 1, 'zapruder': ... | \n",
"
319_1 | \n", "0 | \n", "A friend of mine bought this film for £1, and ... | \n",
" {'just': 1, 'this': 2, 'when': 1, 'as': 5, 's': ... | \n",
"
8713_10 | \n", "1 | \n", "<br /><br />This movie is full of references. Like ... | \n",
" {'peter': 1, 'ii': 1, 'full': 1, 'others': 1, ... | \n",
"
Logistic regression:" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 19077" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 1" ], "text/plain": [ "Number of feature columns : 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 68246" ], "text/plain": [ "Number of unpacked features : 68246" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 68247" ], "text/plain": [ "Number of coefficients : 68247" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 2 | 1.000000 | 1.111660 | 0.942182 | 0.860697 |" ], "text/plain": [ "| 0 | 2 | 1.000000 | 1.111660 | 0.942182 | 0.860697 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 4 | 1.000000 | 1.253890 | 0.968444 | 0.865672 |" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.253890 | 0.968444 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 6 | 1.000000 | 1.390344 | 0.990040 | 0.897512 |" ], "text/plain": [ "| 2 | 6 | 1.000000 | 1.390344 | 0.990040 | 0.897512 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 7 | 1.000000 | 1.474481 | 0.992923 | 0.899502 |" ], "text/plain": [ "| 3 | 7 | 1.000000 | 1.474481 | 0.992923 | 0.899502 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 8 | 1.000000 | 1.563669 | 0.997379 | 0.891542 |" ], "text/plain": [ "| 4 | 8 | 1.000000 | 1.563669 | 0.997379 | 0.891542 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 13 | 1.000000 | 2.052863 | 1.000000 | 0.867662 |" ], "text/plain": [ "| 9 | 13 | 1.000000 | 2.052863 | 1.000000 | 0.867662 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 19077" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 1" ], "text/plain": [ "Number of feature columns : 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 68246" ], "text/plain": [ "Number of unpacked features : 68246" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 68247" ], "text/plain": [ "Number of coefficients : 68247" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 2 | 1.000000 | 0.125585 | 0.942182 | 0.860697 |" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.125585 | 0.942182 | 0.860697 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 4 | 1.000000 | 0.268260 | 0.973738 | 0.875622 |" ], "text/plain": [ "| 1 | 4 | 1.000000 | 0.268260 | 0.973738 | 0.875622 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 5 | 1.000000 | 0.348993 | 0.989411 | 0.881592 |" ], "text/plain": [ "| 2 | 5 | 1.000000 | 0.348993 | 0.989411 | 0.881592 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 6 | 1.000000 | 0.433099 | 0.992976 | 0.884577 |" ], "text/plain": [ "| 3 | 6 | 1.000000 | 0.433099 | 0.992976 | 0.884577 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 7 | 1.000000 | 0.519594 | 0.996016 | 0.881592 |" ], "text/plain": [ "| 4 | 7 | 1.000000 | 0.519594 | 0.996016 | 0.881592 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 12 | 1.000000 | 0.923521 | 0.999685 | 0.886567 |" ], "text/plain": [ "| 9 | 12 | 1.000000 | 0.923521 | 0.999685 | 0.886567 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8676616915422886\n", "PROGRESS: SVMClassifier : 0.8865671641791045\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] } ], "source": [ "model_1 = tc.classifier.create(train_set, target='sentiment', \\\n", " features=['1grams features'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can evaluate the performence of the classifier by evaluating it on the test dataset" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:55.289534Z", "start_time": "2019-06-14T16:51:54.504129Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "result1 = model_1.evaluate(test_set)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In order to get an easy view of the classifier's prediction result, we define and use the following function" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:56.666544Z", "start_time": "2019-06-14T16:51:56.656636Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "******************************\n", "Accuracy : 0.8710858072387149\n", "Confusion Matrix: \n", " +--------------+-----------------+-------+\n", "| target_label | predicted_label | count |\n", "+--------------+-----------------+-------+\n", "| 0 | 1 | 374 |\n", "| 1 | 0 | 260 |\n", "| 1 | 1 | 2133 |\n", "| 0 | 0 | 2151 |\n", "+--------------+-----------------+-------+\n", "[4 rows x 3 columns]\n", "\n" ] } ], "source": [ "def print_statistics(result):\n", " print( \"*\" * 30)\n", " print( \"Accuracy : \", result[\"accuracy\"])\n", " print( \"Confusion Matrix: \\n\", result[\"confusion_matrix\"])\n", "print_statistics(result1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen in the results above, in just a few relatively straight foward lines of code, we have developed a sentiment classifier that has accuracy of about ~0.88. Next, we demonstrate how we can improve the classifier accuracy even more." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Improving The Classifier" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One way to improve the movie reviews sentiment classifier is to extract more meaningful features from the reviews. One method to add additional features, which might be meaningful, is to calculate the frequency of every two consecutive words in each review. To calculate the frequency of each two consecutive words in each review, as before, we will use turicreate's count_ngrams function only this time we will set n to be equal 2 (n=2) to create new column named '2grams features'. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:52:19.472533Z", "start_time": "2019-06-14T16:52:19.463443Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "movies_reviews_data['2grams features'] = tc.text_analytics.count_ngrams(movies_reviews_data['review'],2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:52:20.971123Z", "start_time": "2019-06-14T16:52:20.734798Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
id | \n", "sentiment | \n", "review | \n", "1grams features | \n", "2grams features | \n", "
---|---|---|---|---|
5814_8 | \n", "1 | \n", "With all this stuff going down at the moment with ... | \n",
" {'just': 3, 'sickest': 1, 'smooth': 1, 'this': 11, ... | \n",
" {'alone a': 1, 'most people': 1, 'hope he' ... | \n",
"
2381_9 | \n", "1 | \n", ""The Classic War of the Worlds" by Timothy Hines ... | \n",
" {'year': 1, 'others': 1, 'those': 2, 'this': 1, ... | \n",
" {'slightest resemblance': 1, 'which is': 1, 'very ... | \n",
"
7759_3 | \n", "0 | \n", "The film starts with a manager (Nicholas Bell) ... | \n",
" {'hair': 1, 'bound': 1, 'this': 1, 'when': 2, ... | \n",
" {'quite boring': 1, 'packs a': 1, 'small ... | \n",
"
3630_4 | \n", "0 | \n", "It must be assumed that those who praised this ... | \n",
" {'crocuses': 1, 'that': 7, 'batonzilla': 1, ... | \n",
" {'but i': 1, 'is represented': 1, 'opera ... | \n",
"
9495_8 | \n", "1 | \n", "Superbly trashy and wondrously unpretentious ... | \n",
" {'unshaven': 1, 'just': 1, 'in': 5, 'when': 2, ... | \n",
" {'unpretentious 80': 1, 'sleazy black': 1, 'd ... | \n",
"
8196_8 | \n", "1 | \n", "I dont know why people think this is such a bad ... | \n",
" {'harry': 3, 'this': 4, 'of': 2, 'hurt': 1, ' ... | \n",
" {'like that': 1, 'see this': 1, 'is such': 1, ... | \n",
"
7166_2 | \n", "0 | \n", "This movie could have been very good, but c ... | \n",
" {'acting': 1, 'background': 1, 'just': ... | \n",
" {'linked to': 1, 'way short': 1, 'good but' ... | \n",
"
10633_1 | \n", "0 | \n", "I watched this video at a friend's house. I'm glad ... | \n",
" {'photography': 1, 'others': 1, 'zapruder': ... | \n",
" {'curiously ends': 1, 'several clips': 1, ... | \n",
"
319_1 | \n", "0 | \n", "A friend of mine bought this film for £1, and ... | \n",
" {'just': 1, 'this': 2, 'when': 1, 'as': 5, 's': ... | \n",
" {'bob thornton': 1, 'in the': 1, 'taking a': 1, ... | \n",
"
8713_10 | \n", "1 | \n", "<br /><br />This movie is full of references. Like ... | \n",
" {'peter': 1, 'ii': 1, 'full': 1, 'others': 1, ... | \n",
" {'in the': 1, 'is a': 1, 'lorre this': 1, 'much ... | \n",
"
Logistic regression:" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 19077" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 2" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1206694" ], "text/plain": [ "Number of unpacked features : 1206694" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 1206695" ], "text/plain": [ "Number of coefficients : 1206695" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 3 | 0.500000 | 0.884358 | 0.999266 | 0.866667 |" ], "text/plain": [ "| 0 | 3 | 0.500000 | 0.884358 | 0.999266 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 5 | 0.500000 | 1.542838 | 0.999948 | 0.866667 |" ], "text/plain": [ "| 1 | 5 | 0.500000 | 1.542838 | 0.999948 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 6 | 0.625000 | 1.909261 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 2 | 6 | 0.625000 | 1.909261 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 8 | 0.625000 | 2.436618 | 1.000000 | 0.864677 |" ], "text/plain": [ "| 3 | 8 | 0.625000 | 2.436618 | 1.000000 | 0.864677 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 10 | 0.625000 | 2.971373 | 1.000000 | 0.863682 |" ], "text/plain": [ "| 4 | 10 | 0.625000 | 2.971373 | 1.000000 | 0.863682 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 18 | 0.976563 | 5.228981 | 1.000000 | 0.862687 |" ], "text/plain": [ "| 9 | 18 | 0.976563 | 5.228981 | 1.000000 | 0.862687 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 19077" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 2" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1206694" ], "text/plain": [ "Number of unpacked features : 1206694" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 1206695" ], "text/plain": [ "Number of coefficients : 1206695" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 2 | 1.000000 | 0.710178 | 0.999266 | 0.866667 |" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.710178 | 0.999266 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 4 | 1.000000 | 1.227603 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.227603 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 5 | 1.000000 | 1.524246 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 2 | 5 | 1.000000 | 1.524246 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 6 | 1.000000 | 1.824261 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 3 | 6 | 1.000000 | 1.824261 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 13 | 0.001263 | 3.080125 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 4 | 13 | 0.001263 | 3.080125 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 26 | 0.262737 | 6.006328 | 1.000000 | 0.865672 |" ], "text/plain": [ "| 9 | 26 | 0.262737 | 6.006328 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8626865671641791\n", "PROGRESS: SVMClassifier : 0.8656716417910447\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] } ], "source": [ "train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)\n", "model_2 = tc.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'])\n", "result2 = model_2.evaluate(test_set)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:53:48.981670Z", "start_time": "2019-06-14T16:53:48.974028Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "******************************\n", "Accuracy : 0.8816592110614071\n", "Confusion Matrix: \n", " +--------------+-----------------+-------+\n", "| target_label | predicted_label | count |\n", "+--------------+-----------------+-------+\n", "| 0 | 1 | 343 |\n", "| 1 | 0 | 239 |\n", "| 1 | 1 | 2154 |\n", "| 0 | 0 | 2182 |\n", "+--------------+-----------------+-------+\n", "[4 rows x 3 columns]\n", "\n" ] } ], "source": [ "print_statistics(result2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Indeed, the new constructed classifier seems to be more accurate with an accuracy of about ~0.9." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Unlabeled Test File" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To test how well the presented method works, we will use all the 25,000 labeled IMDB movie reviews in the train dataset to construct a classifier. Afterwards, we will utilize the constructed classifier to predict sentiment for each review in the unlabeled dataset. Lastly, we will create a submission file according to Kaggle's guidelines and submit it. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:55:06.281150Z", "start_time": "2019-06-14T16:54:02.968826Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.282738 secs." ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.282738 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.507212 secs." ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.507212 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.\n", " You can set ``validation_set=None`` to disable validation tracking.\n", "\n", "PROGRESS: The following methods are available for this type of problem.\n", "PROGRESS: LogisticClassifier, SVMClassifier\n", "PROGRESS: The returned model will be chosen according to validation accuracy.\n" ] }, { "data": { "text/html": [ "
Logistic regression:" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 23750" ], "text/plain": [ "Number of examples : 23750" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 2" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1407914" ], "text/plain": [ "Number of unpacked features : 1407914" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 1407915" ], "text/plain": [ "Number of coefficients : 1407915" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 2 | 1.000000 | 0.772874 | 0.998821 | 0.896000 |" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.772874 | 0.998821 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 4 | 1.000000 | 1.443709 | 0.999916 | 0.894400 |" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.443709 | 0.999916 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 6 | 0.648072 | 2.077022 | 0.999958 | 0.895200 |" ], "text/plain": [ "| 2 | 6 | 0.648072 | 2.077022 | 0.999958 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 8 | 0.648072 | 2.769055 | 0.999958 | 0.894400 |" ], "text/plain": [ "| 3 | 8 | 0.648072 | 2.769055 | 0.999958 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 10 | 0.648072 | 3.420360 | 0.999958 | 0.894400 |" ], "text/plain": [ "| 4 | 10 | 0.648072 | 3.420360 | 0.999958 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 22 | 0.486054 | 6.816458 | 1.000000 | 0.892800 |" ], "text/plain": [ "| 9 | 22 | 0.486054 | 6.816458 | 1.000000 | 0.892800 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples : 23750" ], "text/plain": [ "Number of examples : 23750" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes : 2" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns : 2" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1407914" ], "text/plain": [ "Number of unpacked features : 1407914" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients : 1407915" ], "text/plain": [ "Number of coefficients : 1407915" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 2 | 1.000000 | 0.724382 | 0.998821 | 0.896000 |" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.724382 | 0.998821 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1 | 4 | 1.000000 | 1.284643 | 0.999916 | 0.896000 |" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.284643 | 0.999916 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2 | 5 | 1.000000 | 1.634216 | 0.999958 | 0.896000 |" ], "text/plain": [ "| 2 | 5 | 1.000000 | 1.634216 | 0.999958 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3 | 6 | 1.000000 | 2.002875 | 0.999958 | 0.896000 |" ], "text/plain": [ "| 3 | 6 | 1.000000 | 2.002875 | 0.999958 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4 | 13 | 0.000338 | 3.462338 | 1.000000 | 0.895200 |" ], "text/plain": [ "| 4 | 13 | 0.000338 | 3.462338 | 1.000000 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9 | 38 | 4.080042 | 9.019969 | 1.000000 | 0.895200 |" ], "text/plain": [ "| 9 | 38 | 4.080042 | 9.019969 | 1.000000 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8928\n", "PROGRESS: SVMClassifier : 0.8952\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.313905 secs." ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.313905 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.560208 secs." ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.560208 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traindata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv\"\n", "testdata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv\"\n", "#creating classifier using all 25,000 reviews\n", "train_data = tc.SFrame.read_csv(traindata_path,header=True, delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )\n", "train_data['1grams features'] = tc.text_analytics.count_ngrams(train_data['review'],1)\n", "train_data['2grams features'] = tc.text_analytics.count_ngrams(train_data['review'],2)\n", "\n", "cls = tc.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])\n", "#creating the test dataset\n", "test_data = tc.SFrame.read_csv(testdata_path,header=True, delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, 'review':str } )\n", "test_data['1grams features'] = tc.text_analytics.count_ngrams(test_data['review'],1)\n", "test_data['2grams features'] = tc.text_analytics.count_ngrams(test_data['review'],2)\n", "\n", "#predicting the sentiment of each review in the test dataset\n", "test_data['sentiment'] = cls.classify(test_data)['class'].astype(int)\n", "\n", "#saving the prediction to a CSV for submission\n", "test_data[['id','sentiment']].save(\"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/predictions.csv\", format=\"csv\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We then submitted the predictions.csv file to the Kaggle challange website and scored AUC of about 0.88." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Further Readings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Further reading materials can be found in the following links:\n", "\n", "http://en.wikipedia.org/wiki/Bag-of-words_model\n", "\n", "https://dato.com/products/create/docs/generated/graphlab.SFrame.html\n", "\n", "https://dato.com/products/create/docs/graphlab.toolkits.classifier.html\n", "\n", "https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words\n", "\n", "Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). \"Learning Word Vectors for Sentiment Analysis.\" The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "311.997px", "left": "719px", "top": "111px", "width": "416.267px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }