因为我们streaming的东西是存到了HDFS,然后可以存成CSV的格式,然后我就需要从HDFS Load数据进去Hive,然后后续的操作就是用Hive去做别的操作。

Detailed Steps

  1. Verify the Hive config(hive-site.xml in conf):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
    </property>

    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>group11</value>
    </property>

    <property>c
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>student</value>
    </property>
  2. Check the HDFS warehouse path:
    hduser@student59:~$ hdfs dfs -ls /user/hive/warehouse

  3. Check Hive Terminal:

    1
    2
    3
    hive
    show databases;
    show tables;

Restusts:

1
2
3
4
5
6
7
hive> show databases;
OK
default
Time taken: 2.95 seconds, Fetched: 1 row(s)
hive> show tables;
OK
test

表明这里有个数据库叫做default,然后有一个table叫做test;

  1. Create Table:
    CREATE TABLE b_results (b_price float, s_output int, t_time String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    image.png
    image.png
    查看表结构:
    image.png
  1. Load data form local (called it test.csv):
    path:/opt/apache-hive-2.3.2-bin/test.csv
1
2
LOAD DATA LOCAL INPATH '/opt/apache-hive-2.3.2-bin/test.csv' OVERWRITE INTO TABLE test;
Loading data to table default.test

image.png

image.png

Read HDFS file and save into hive

LOAD DATA INPATH '/twitter_sentiment_bitcoin/student59.txt' OVERWRITE INTO TABLE b_results;
Result.png

image.png

Using Tabueau to connect to Hive (visualization)

image.png

Actually the speed of connecting to Hive table is quite slow (I think it is because the size of the data is too large).