Pig

pig online doc

高级数据流语言 Pig Latin
运行Pig Latin程序的执行环境

Pig能够让你专心于数据及业务本身，而不是纠结于数据的格式转换以及MapReduce程序的编写。

Execution Modes

local

only requires a single machine. Pig will run on the local host and access the local filesystem. pig -x local ...

MapReduce

pig ...

Interactive Mode

Pig can be run interactively in the Grunt shell.

pig -x local
...
grunt>

Batch Mode

use pig script pig -x local id.pig

pig 语法

Each statement is an operator that takes a relation as an input, performs a transformation on that relation, and produces a relation as an out‐ put. Statements can span multiple lines,;结尾。

A LOAD statement that reads the data from the filesystem
One or more statements to transform the data
A DUMP or STORE statement to view or store the results

load

USING default keyword \t AS default not named and type bytearray

LOAD 'data' [USING function] [AS schema];

A = LOAD 'students' AS (name:chararray, age:int);
DUMP A; 

(john,21,3.89) 
(sally,19,2.56) 
(alice,22,3.76) 
(doug,19,1.98) 
(susan,26,3.25)

Transforming Data

条件关系 and or not

FILTER

处理列

A = LOAD 'students' AS (name:chararray, age:int, gpa:float);

DUMP A; 
(john,21,3.89)
(sally,19,2.56) 
(alice,22,3.76)
(doug,19,1.98)
(susan,26,3.25)

R = FILTER A BY age>=20;
DUMP R; 

(john,21,3.89) 
(alice,22,3.76) 
(susan,26,3.25)

FOREACH

处理行有点像select

1	`R = FOREACH A GENERATE *;`

GROUP

1	`B = GROUP A BY age;`

STORE

1	`STORE alias INTO 'directory' [USING function];`

A = LOAD 'students' AS (name:chararray, age:int, gpa:float);
STORE A INTO 'output' USING PigStore('|');
CAT output;

UDF

pig_util

from pig.pig_util import outputSchema


@outputSchema('word:chararray')
def reverse(word):
    """
    Return the reverse text of the provided word """
    return word[::-1]


@outputSchema('length:int')
def num_chars(word):
    """
    Return the length of the provided word """
    return len(word)

REGISTER 'my_udf.py' USING streaming_python AS string_udf;
term_length = FOREACH unique_terms GENERATE word, string_udf.num_chars(word) as length;

Hive

HBase

数据模型概念

表

行

列族

限定字符

单元格

时间戳

列族存储

降低IO
大并发查询
高数据压缩比

架构

与Hadoop访问过程，结构有点像

zookeeper

master

region

表 Region

Region 定位

结构

MemStore容量有限，周期性写入到StoreFile,HLog写入一个标记。每次缓存刷新生成新的StoreFile，当StoreFile数量到达某个阈值，会合并一个大StoreFile。当大StoreFile大小到达某个阈值，会分裂。

Pig