- 系统环境: Win10(64位) <-- 一定要64位
- Linux暂时不讲,因为部署起来比Windows简单
- Java版本: Java 1.8.0_144
- Hadoop版本: Apache Hadoop 2.7.4
- (接下来的教程使用的时2.7.4作为"栗子")
- Python版本: Python 3.4.4
- Hadoop下载地址:
- 尽量选择国内源,如果需要历史版本的话只能去官方的源下载.
- 1、阿里源:
- 2、清华源:
- 3、官方:
- Winutils(Linux的可以略过):
- 1、winutils这个版本需要和Hadoop版本进行对应
- 2、如果不下载的话就需要去网上找别人编译好的winutils的包.(概率容易出bug)
- Java和Python的下载自行百度
- JDK1.8
- Python 3.4.4
- Python环境自行搭建,不作阐述
- 步骤0(Java环境变量)
- Windows下,Java配置路径千万不要有空格.
- 安装完Java后自己去安装目录拷出来到一个完全没空格的路径.
- 配置Java环境变量
- 步骤1(Hadoop环境搭建)
- 注: 以下部署为单机单节点部署
- 1.1、解压到你自己归类的目录下
- 我自己在D盘建立个路径: D:/bigdata/
- 解压完后hadoop的路径为: D:/bigdata/hadoop-2.7.4
- 1.2、进入到hadoop根目录下的etc\hadoop\中
- 1.3、修改core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml(其中mapred的原本后缀有个.template,重名去掉后缀)
- core-site.xml:
fs.defaultFS hdfs://localhost:9000 HDFS的URI,文件系统://namenode标识:端口号 hadoop.tmp.dir /D:/bigdata/hadoop-2.7.4/workplace/tmp namenode上本地的hadoop临时文件夹 - hdfs-site.xml:
dfs.replication 1 副本个数,配置默认是3,应小于datanode机器数量 dfs.name.dir /D:/bigdata/hadoop-2.7.4/workplace/name namenode上存储hdfs名字空间元数据 dfs.data.dir /D:/bigdata/hadoop-2.7.4/workplace/data datanode上数据块的物理存储位置 dfs.webhdfs.enabled true WebHDFS接口 dfs.permissions false HDFS权限控制 - yarn-site.xml:
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.resource.memory-mb 8192 yarn.scheduler.minimum-allocation-mb 1536 yarn.scheduler.maximum-allocation-mb 4096 yarn.nodemanager.resource.cpu-vcores 2 yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler - mapred-site.xml:
mapreduce.framework.name yarn mapreduce.map.memory.mb 2048 mapreduce.reduce.memory.mb 2048 mapreduce.jobtracker.http.address localhost:50030 mapreduce.jobhistory.address localhost:10020 mapreduce.jobhistory.webapp.address localhost:19888 mapred.job.tracker http://localhost:9001 - 1.4、回到hadoop根目录创建hdfs-site和core-site中指定目录的文件夹
- 1.5、配置环境变量:
- 1.5.1、添加系统变量HADOOP_CONF_DIR
- 1.5.2、添加系统变量HADOOP_HOME
- 1.5.3、Path路径下添加:
- 1.6、运行namenode的初始化
hadoop namenode -format复制代码
- 1.7、(1.6的步骤没有报错之后)启动hadoop
cd /d D:\bigdata\hadoop-2.7.4\sbindir## 推荐启动步骤start-dfs.cmdstart-yarn.cmd## 粗暴start-all.cmd复制代码
- 1.8、(以上步骤都没有报错后)打开新的cmd窗口
jps# 看看对应的组件是否启动成功并占有进程ID号复制代码
- 1.9、测试访问链接(访问无误后执行下一步):
- 2.0、测试hadoop mapreduce example能否运行:
# 可以查看有什么测试的方法hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar -info# 选用最经常使用的PI测试(3个Task, 100个取样个数,两数相乘为总样本数)hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar PI 3 100复制代码
- 2.1、成功运行后,开始编写python的mapper和reducer
- Hadoop Streaming的官方Jar包注释:
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]Options: -inputDFS input file(s) for the Map step. -output DFS output directory for the Reduce step. -mapper Optional. Command to be run as mapper. -combiner Optional. Command to be run as combiner. -reducer Optional. Command to be run as reducer. -file Optional. File/dir to be shipped in the Job jar file. Deprecated. Use generic option "-files" instead. -inputformat Optional. The input format class. -outputformat Optional. The output format class. -partitioner Optional. The partitioner class. -numReduceTasks Optional. Number of reduce tasks. -inputreader Optional. Input recordreader spec. -cmdenv = Optional. Pass env.var to streaming commands. -mapdebug Optional. To run this script when a map task fails. -reducedebug Optional. To run this script when a reduce task fails. -io Optional. Format to use for input to and output from mapper/reducer commands -lazyOutput Optional. Lazily create Output. -background Optional. Submit the job and don't wait till it completes. -verbose Optional. Print verbose output. -info Optional. Print detailed usage. -help Optional. Print help message.Generic options supported are-conf specify an application configuration file-D use value for given property-fs specify a namenode-jt specify a ResourceManager-files specify comma separated files to be copied to the map reduce cluster-libjars specify comma separated jar files to include in the classpath.-archives specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]Usage tips:In -input: globbing on is supported and can have multiple -inputDefault Map input format: a line is a record in UTF-8 the key part ends at first TAB, the rest of the line is the valueTo pass a Custom input format: -inputformat package.MyInputFormatSimilarly, to pass a custom output format: -outputformat package.MyOutputFormatThe files with extensions .class and .jar/.zip, specified for the -file argument[s], end up in "classes" and "lib" directories respectively inside the working directory when the mapper and reducer are run. All other files specified for the -file argument[s] end up in the working directory when the mapper and reducer are run. The location of this working directory is unspecified.To set the number of reduce tasks (num. of output files) as, say 10: Use -numReduceTasks 10To skip the sort/combine/shuffle/sort/reduce step: Use -numReduceTasks 0 Map output then becomes a 'side-effect output' rather than a reduce input. This speeds up processing. This also feels more like "in-place" processing because the input filename and the map input order are preserved. This is equivalent to -reducer NONETo speed up the last maps: -D mapreduce.map.speculative=trueTo speed up the last reduces: -D mapreduce.reduce.speculative=trueTo name the job (appears in the JobTracker Web UI): -D mapreduce.job.name='My Job'To change the local temp directory: -D dfs.data.dir=/tmp/dfs -D stream.tmpdir=/tmp/streamingAdditional local temp directories with -jt local: -D mapreduce.cluster.local.dir=/tmp/local -D mapreduce.jobtracker.system.dir=/tmp/system -D mapreduce.cluster.temp.dir=/tmp/tempTo treat tasks with non-zero exit status as SUCCEDED: -D stream.non.zero.exit.is.failure=falseUse a custom hadoop streaming build along with standard hadoop install: $HADOOP_PREFIX/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\ [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jarFor more details about jobconf parameters see: http://wiki.apache.org/hadoop/JobConfFileTo set an environement variable in a streaming command: -cmdenv EXAMPLE_DIR=/home/example/dictionaries/Shortcut: setenv HSTREAMING "$HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar"Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl" -file /local/filter.pl -input "/logs/0604*/*" [...] Ships a script, invokes the non-shipped perl interpreter. Shipped files go to the working directory so filter.pl is found by perl. Input files are all the daily logs for days in month 2006-04复制代码
- 数据准备(我在网上找的一个books.json的数据)
- 1、json格式的数据,对对每个txt的内容进行分词
- 2、将每个词抽取出来,按照某个词对应一个文本或多个文本按行输出,例:
Data: ["test_1.txt", "[ apple pipe ]"] ["test_2.txt", "[ apple company ]"]Result: apple ["test_1.txt", "test_2.txt"] pipe ["test_1.txt"] company ["test_2.txt"]复制代码
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""Created on 2017年10月30日@author: Leo"""# Python内部库import sysimport jsonfor line in sys.stdin: line = line.strip() record = json.loads(line) file_name = record[0] value = record[1] words = value.split() for word in words: print("%s\t%s" % (word, file_name))复制代码
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""Created on 2017年10月30日@author: Leo"""# Python内部库import sysmedia = {}word_in_media = {}# maps words to their countsfor line in sys.stdin: (word, file_name) = line.strip().split('\t', 1) media.setdefault(word, []) media[word].append(file_name)for word in media: word_in_media.setdefault(word, list(set(media[word])))for word in word_in_media: print("%s\t%s" % (word, word_in_media[word]))复制代码
- 1、上传命令
# 如果还没在HDFS上创建过文件夹的话,需要先创建文件夹# 以下是本人创建的方式,可以自行创建hdfs dfs -mkdir -p /user/Leo/input# 从本地上传到HDFS中hdfs dfs -copyFromLocal <绝对路径> \books.json /user/Leo/input/# 打开localhost:50070的HDFS页面后查看文件是否存在复制代码 绝对路径>
- 2、删除Output文件夹
# 这个是如果Mapreduce执行过程中出错,解决后再出错的时候记得删除output文件夹hdfs dfs -rm -r /user/Leo/output复制代码
- 3、关闭HDFS Safemode模式
# 不正常的操作触发了HDFS启动了安全模式hdfs dfsadmin -safemode leave复制代码
- 4、执行hadoop streaming的命令(记得别漏了横杠! 记得别漏了横杠!)
# 以下代码过长,我用linux命令换行的方式进行展示hadoop jar D:/bigdata/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar \ -D stream.non.zero.exit.is.failure=false \-input /user/Leo/input/books.json \ -output /user/Leo/output \ -mapper "python mapper.py" \-reducer "python reducer.py" \-file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/mapper.py \-file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/reducer.py# 解释:1、jar 后面跟的是Jar包的路径,官方提倡用环境变量加路径的方式,我这为了演示用了绝对路径进行展示2、-D stream.non.zero.exit.is.failure=false 这句话的意思是如果函数返回值(即mapper或reducer没有return 0,则函数为异常结果.加了这句就可以跳过检查.)3、input: 就是HDFS的文件4、output: 就是M-R任务结束后的文件存放的地方5、mapper: 指定执行mapper的脚本或代码6、reducer: 指定执行reducer的脚本或代码7、-file: 指定代码的位置(多个文件用-files,为了展示更清晰,我用了旧版的-file的形式进行展示)复制代码
- 5、执行过后再HDFS指定的output路径下会出现一个名为part-00000的文件,结果存储在里面,可以直接在网页上下载到本地或用代码下载到本地.
hdfs -dfs -get
(/user/Leo/output/part-00000) <本地路径> 复制代码 本地路径>
- 总体来说,Hadoop Streaming上手容易,主要难在对于Map-reduce的模式的理解上.
- 需要理解如下几点:
- Map-reduce(即Map-Shuffle-Reduce)的工作流程
- Hadoop streaming key/value的划分方式
- 结合自己的业务需求
- 满足业务所使用的脚本语言的特点进行编写.
优点: (方便,快捷)
- 只要支持stdin,stdout的语言都可以(Unix风格)
- 灵活,一般多用于处理一些临时任务,不用改动项目的代码结构
- 本地调试
缺点: (主要都是性能问题)
- 也正是由于因为stdin,stdout,数据传输交换的过程中,难免要对数据类型进行转换,所以会增加代码的执行时间.
- 在Windows环境下出现的一个奇怪的Error
- 解决方法 ->
- 默认管理员可以创建符号表,可以使用管理员命令行启动 hadoop的应用
- 通过修改用户策略, 步骤如下:
- win+R -> gpedit.msc
- 计算机配置 -> windows设置 -> 安全设置 -> 本地策略 -> 用户权限分配 -> 创建符号链接
- 把用户添加进去,重启或者注销