问题反馈
- 部署或使用时有不明白的可以联系我
- Wechat:Leo-sunhailin
- QQ: 379978424
- 目录
搭建环境
- 系统环境: Win10(64位) <-- 一定要64位
- Linux暂时不讲,因为部署起来比Windows简单
- Java版本: Java 1.8.0_144
- Hadoop版本: Apache Hadoop 2.7.4
- (接下来的教程使用的时2.7.4作为"栗子")
- Python版本: Python 3.4.4
下载方式
- Hadoop下载地址:
- 尽量选择国内源,如果需要历史版本的话只能去官方的源下载.
- 1、阿里源:
- 2、清华源:
- 3、官方:
- Winutils(Linux的可以略过):
- 1、winutils这个版本需要和Hadoop版本进行对应
- 2、如果不下载的话就需要去网上找别人编译好的winutils的包.(概率容易出bug)
- 1、winutils这个版本需要和Hadoop版本进行对应
- Java和Python的下载自行百度
- JDK1.8
- Python 3.4.4
部署和测试
- Python环境自行搭建,不作阐述
- 步骤0(Java环境变量)
- 最最最重要的问题!
- Windows下,Java配置路径千万不要有空格.
- 安装完Java后自己去安装目录拷出来到一个完全没空格的路径.
- 配置Java环境变量
- 最最最重要的问题!
- 步骤1(Hadoop环境搭建)
- 注: 以下部署为单机单节点部署
- 1.1、解压到你自己归类的目录下
- 我自己在D盘建立个路径: D:/bigdata/
- 解压完后hadoop的路径为: D:/bigdata/hadoop-2.7.4
- 1.2、进入到hadoop根目录下的etc\hadoop\中
- 1.3、修改core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml(其中mapred的原本后缀有个.template,重名去掉后缀)
- core-site.xml:
fs.defaultFS hdfs://localhost:9000 HDFS的URI,文件系统://namenode标识:端口号 hadoop.tmp.dir /D:/bigdata/hadoop-2.7.4/workplace/tmp namenode上本地的hadoop临时文件夹 - hdfs-site.xml:
dfs.replication 1 副本个数,配置默认是3,应小于datanode机器数量 dfs.name.dir /D:/bigdata/hadoop-2.7.4/workplace/name namenode上存储hdfs名字空间元数据 dfs.data.dir /D:/bigdata/hadoop-2.7.4/workplace/data datanode上数据块的物理存储位置 dfs.webhdfs.enabled true WebHDFS接口 dfs.permissions false HDFS权限控制 - yarn-site.xml:
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.resource.memory-mb 8192 yarn.scheduler.minimum-allocation-mb 1536 yarn.scheduler.maximum-allocation-mb 4096 yarn.nodemanager.resource.cpu-vcores 2 yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler - mapred-site.xml:
mapreduce.framework.name yarn mapreduce.map.memory.mb 2048 mapreduce.reduce.memory.mb 2048 mapreduce.jobtracker.http.address localhost:50030 mapreduce.jobhistory.address localhost:10020 mapreduce.jobhistory.webapp.address localhost:19888 mapred.job.tracker http://localhost:9001 - 1.4、回到hadoop根目录创建hdfs-site和core-site中指定目录的文件夹
- 1.5、配置环境变量:
- 1.5.1、添加系统变量HADOOP_CONF_DIR
D:\bigdata\hadoop-2.7.4\etc\hadoop\复制代码
- 1.5.2、添加系统变量HADOOP_HOME
D:\bigdata\hadoop-2.7.4复制代码
- 1.5.3、Path路径下添加:
D:\bigdata\hadoop-2.7.4\bin复制代码
- 1.6、运行namenode的初始化
hadoop namenode -format复制代码
- 1.7、(1.6的步骤没有报错之后)启动hadoop
cd /d D:\bigdata\hadoop-2.7.4\sbindir## 推荐启动步骤start-dfs.cmdstart-yarn.cmd## 粗暴start-all.cmd复制代码
- 1.8、(以上步骤都没有报错后)打开新的cmd窗口
jps# 看看对应的组件是否启动成功并占有进程ID号复制代码
- 1.9、测试访问链接(访问无误后执行下一步):
- 2.0、测试hadoop mapreduce example能否运行:
# 可以查看有什么测试的方法hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar -info# 选用最经常使用的PI测试(3个Task, 100个取样个数,两数相乘为总样本数)hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar PI 3 100复制代码
- 2.1、成功运行后,开始编写python的mapper和reducer
代码案例
- Hadoop Streaming的官方Jar包注释:
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]Options: -inputDFS input file(s) for the Map step. -output DFS output directory for the Reduce step. -mapper Optional. Command to be run as mapper. -combiner Optional. Command to be run as combiner. -reducer Optional. Command to be run as reducer. -file Optional. File/dir to be shipped in the Job jar file. Deprecated. Use generic option "-files" instead. -inputformat Optional. The input format class. -outputformat Optional. The output format class. -partitioner Optional. The partitioner class. -numReduceTasks Optional. Number of reduce tasks. -inputreader Optional. Input recordreader spec. -cmdenv = Optional. Pass env.var to streaming commands. -mapdebug Optional. To run this script when a map task fails. -reducedebug Optional. To run this script when a reduce task fails. -io Optional. Format to use for input to and output from mapper/reducer commands -lazyOutput Optional. Lazily create Output. -background Optional. Submit the job and don't wait till it completes. -verbose Optional. Print verbose output. -info Optional. Print detailed usage. -help Optional. Print help message.Generic options supported are-conf specify an application configuration file-D use value for given property-fs specify a namenode-jt specify a ResourceManager-files specify comma separated files to be copied to the map reduce cluster-libjars specify comma separated jar files to include in the classpath.-archives specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]Usage tips:In -input: globbing on is supported and can have multiple -inputDefault Map input format: a line is a record in UTF-8 the key part ends at first TAB, the rest of the line is the valueTo pass a Custom input format: -inputformat package.MyInputFormatSimilarly, to pass a custom output format: -outputformat package.MyOutputFormatThe files with extensions .class and .jar/.zip, specified for the -file argument[s], end up in "classes" and "lib" directories respectively inside the working directory when the mapper and reducer are run. All other files specified for the -file argument[s] end up in the working directory when the mapper and reducer are run. The location of this working directory is unspecified.To set the number of reduce tasks (num. of output files) as, say 10: Use -numReduceTasks 10To skip the sort/combine/shuffle/sort/reduce step: Use -numReduceTasks 0 Map output then becomes a 'side-effect output' rather than a reduce input. This speeds up processing. This also feels more like "in-place" processing because the input filename and the map input order are preserved. This is equivalent to -reducer NONETo speed up the last maps: -D mapreduce.map.speculative=trueTo speed up the last reduces: -D mapreduce.reduce.speculative=trueTo name the job (appears in the JobTracker Web UI): -D mapreduce.job.name='My Job'To change the local temp directory: -D dfs.data.dir=/tmp/dfs -D stream.tmpdir=/tmp/streamingAdditional local temp directories with -jt local: -D mapreduce.cluster.local.dir=/tmp/local -D mapreduce.jobtracker.system.dir=/tmp/system -D mapreduce.cluster.temp.dir=/tmp/tempTo treat tasks with non-zero exit status as SUCCEDED: -D stream.non.zero.exit.is.failure=falseUse a custom hadoop streaming build along with standard hadoop install: $HADOOP_PREFIX/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\ [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jarFor more details about jobconf parameters see: http://wiki.apache.org/hadoop/JobConfFileTo set an environement variable in a streaming command: -cmdenv EXAMPLE_DIR=/home/example/dictionaries/Shortcut: setenv HSTREAMING "$HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar"Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl" -file /local/filter.pl -input "/logs/0604*/*" [...] Ships a script, invokes the non-shipped perl interpreter. Shipped files go to the working directory so filter.pl is found by perl. Input files are all the daily logs for days in month 2006-04复制代码
- 数据准备(我在网上找的一个books.json的数据)
["milton-paradise.txt", "[ Paradise Lost by John Milton 1667 ] Book I Of Man ' s first disobedience , and the fruit Of that forbidden tree whose mortal taste Brought death into the World , and all our woe , With loss of Eden , till one greater Man Restore us , and regain the blissful seat , Sing , Heavenly Muse , that , on the secret top Of Oreb , or of Sinai , didst inspire That shepherd who first taught the chosen seed In the beginning how the heavens and earth Rose out of Chaos : or , if Sion hill Delight thee more , and Siloa ' s brook that flowed Fast by the oracle of God , I thence Invoke thy aid to my adventurous song , That with no middle flight intends to soar Above th ' Aonian mount , while it pursues Things unattempted yet in prose or rhyme ."]["edgeworth-parents.txt", "[ The Parent ' s Assistant , by Maria Edgeworth ] THE ORPHANS . Near the ruins of the castle of Rossmore , in Ireland , is a small cabin , in which there once lived a widow and her four children . As long as she was able to work , she was very industrious , and was accounted the best spinner in the parish ; but she overworked herself at last , and fell ill , so that she could not sit to her wheel as she used to do , and was obliged to give it up to her eldest daughter , Mary ."]["austen-emma.txt", "[ Emma by Jane Austen 1816 ] VOLUME I CHAPTER I Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her . She was the youngest of the two daughters of a most affectionate , indulgent father ; and had , in consequence of her sister ' s marriage , been mistress of his house from a very early period . Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses ; and her place had been supplied by an excellent woman as governess , who had fallen little short of a mother in affection ."]["chesterton-ball.txt", "[ The Ball and The Cross by G . K . Chesterton 1909 ] I . A DISCUSSION SOMEWHAT IN THE AIR The flying ship of Professor Lucifer sang through the skies like a silver arrow ; the bleak white steel of it , gleaming in the bleak blue emptiness of the evening . That it was far above the earth was no expression for it ; to the two men in it , it seemed to be far above the stars . The professor had himself invented the flying machine , and had also invented nearly everything in it ."]["bible-kjv.txt", "[ The King James Bible ] The Old Testament of the King James Bible The First Book of Moses : Called Genesis 1 : 1 In the beginning God created the heaven and the earth . 1 : 2 And the earth was without form , and void ; and darkness was upon the face of the deep . And the Spirit of God moved upon the face of the waters . 1 : 3 And God said , Let there be light : and there was light . 1 : 4 And God saw the light , that it was good : and God divided the light from the darkness . 1 : 5 And God called the light Day , and the darkness he called Night . And the evening and the morning were the first day ."]["chesterton-thursday.txt", "[ The Man Who Was Thursday by G . K . Chesterton 1908 ] To Edmund Clerihew Bentley A cloud was on the mind of men , and wailing went the weather , Yea , a sick cloud upon the soul when we were boys together . Science announced nonentity and art admired decay ; The world was old and ended : but you and I were gay ; Round us in antic order their crippled vices came -- Lust that had lost its laughter , fear that had lost its shame . Like the white lock of Whistler , that lit our aimless gloom , Men showed their own white feather as proudly as a plume . Life was a fly that faded , and death a drone that stung ; The world was very old indeed when you and I were young ."]["blake-poems.txt", "[ Poems by William Blake 1789 ] SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild , Piping songs of pleasant glee , On a cloud I saw a child , And he laughing said to me : \" Pipe a song about a Lamb !\" So I piped with merry cheer . \" Piper , pipe that song again ;\" So I piped : he wept to hear . \" Drop thy pipe , thy happy pipe ; Sing thy songs of happy cheer :!\" So I sang the same again , While he wept with joy to hear . \" Piper , sit thee down and write In a book , that all may read .\" So he vanish ' d from my sight ; And I pluck ' d a hollow reed , And I made a rural pen , And I stain ' d the water clear , And I wrote my happy songs Every child may joy to hear ."]["shakespeare-caesar.txt", "[ The Tragedie of Julius Caesar by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Flauius , Murellus , and certaine Commoners ouer the Stage . Flauius . Hence : home you idle Creatures , get you home : Is this a Holiday ? What , know you not ( Being Mechanicall ) you ought not walke Vpon a labouring day , without the signe Of your Profession ? Speake , what Trade art thou ? Car . Why Sir , a Carpenter Mur . Where is thy Leather Apron , and thy Rule ? What dost thou with thy best Apparrell on ? You sir , what Trade are you ? Cobl . Truely Sir , in respect of a fine Workman , I am but as you would say , a Cobler Mur . But what Trade art thou ? Answer me directly Cob . A Trade Sir , that I hope I may vse , with a safe Conscience , which is indeed Sir , a Mender of bad soules Fla ."]["whitman-leaves.txt", "[ Leaves of Grass by Walt Whitman 1855 ] Come , said my soul , Such verses for my Body let us write , ( for we are one ,) That should I after return , Or , long , long hence , in other spheres , There to some group of mates the chants resuming , ( Tallying Earth ' s soil , trees , winds , tumultuous waves ,) Ever with pleas ' d smile I may keep on , Ever and ever yet the verses owning -- as , first , I here and now Signing for Soul and Body , set to them my name , Walt Whitman [ BOOK I . INSCRIPTIONS ] } One ' s - Self I Sing One ' s - self I sing , a simple separate person , Yet utter the word Democratic , the word En - Masse ."]["melville-moby_dick.txt", "[ Moby Dick by Herman Melville 1851 ] ETYMOLOGY . ( Supplied by a Late Consumptive Usher to a Grammar School ) The pale Usher -- threadbare in coat , heart , body , and brain ; I see him now . He was ever dusting his old lexicons and grammars , with a queer handkerchief , mockingly embellished with all the gay flags of all the known nations of the world . He loved to dust his old grammars ; it somehow mildly reminded him of his mortality ."]复制代码
-
任务需求:
- 1、json格式的数据,对对每个txt的内容进行分词
- 2、将每个词抽取出来,按照某个词对应一个文本或多个文本按行输出,例:
Data: ["test_1.txt", "[ apple pipe ]"] ["test_2.txt", "[ apple company ]"]Result: apple ["test_1.txt", "test_2.txt"] pipe ["test_1.txt"] company ["test_2.txt"]复制代码
-
:
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""Created on 2017年10月30日@author: Leo"""# Python内部库import sysimport jsonfor line in sys.stdin: line = line.strip() record = json.loads(line) file_name = record[0] value = record[1] words = value.split() for word in words: print("%s\t%s" % (word, file_name))复制代码
- :
#!/usr/bin/env python# -*- coding: UTF-8 -*-"""Created on 2017年10月30日@author: Leo"""# Python内部库import sysmedia = {}word_in_media = {}# maps words to their countsfor line in sys.stdin: (word, file_name) = line.strip().split('\t', 1) media.setdefault(word, []) media[word].append(file_name)for word in media: word_in_media.setdefault(word, list(set(media[word])))for word in word_in_media: print("%s\t%s" % (word, word_in_media[word]))复制代码
-
将books.json上传到HDFS
- 1、上传命令
# 如果还没在HDFS上创建过文件夹的话,需要先创建文件夹# 以下是本人创建的方式,可以自行创建hdfs dfs -mkdir -p /user/Leo/input# 从本地上传到HDFS中hdfs dfs -copyFromLocal <绝对路径> \books.json /user/Leo/input/# 打开localhost:50070的HDFS页面后查看文件是否存在复制代码 绝对路径>
- 2、删除Output文件夹
# 这个是如果Mapreduce执行过程中出错,解决后再出错的时候记得删除output文件夹hdfs dfs -rm -r /user/Leo/output复制代码
- 3、关闭HDFS Safemode模式
# 不正常的操作触发了HDFS启动了安全模式hdfs dfsadmin -safemode leave复制代码
- 4、执行hadoop streaming的命令(记得别漏了横杠! 记得别漏了横杠!)
# 以下代码过长,我用linux命令换行的方式进行展示hadoop jar D:/bigdata/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar \ -D stream.non.zero.exit.is.failure=false \-input /user/Leo/input/books.json \ -output /user/Leo/output \ -mapper "python mapper.py" \-reducer "python reducer.py" \-file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/mapper.py \-file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/reducer.py# 解释:1、jar 后面跟的是Jar包的路径,官方提倡用环境变量加路径的方式,我这为了演示用了绝对路径进行展示2、-D stream.non.zero.exit.is.failure=false 这句话的意思是如果函数返回值(即mapper或reducer没有return 0,则函数为异常结果.加了这句就可以跳过检查.)3、input: 就是HDFS的文件4、output: 就是M-R任务结束后的文件存放的地方5、mapper: 指定执行mapper的脚本或代码6、reducer: 指定执行reducer的脚本或代码7、-file: 指定代码的位置(多个文件用-files,为了展示更清晰,我用了旧版的-file的形式进行展示)复制代码
- 5、执行过后再HDFS指定的output路径下会出现一个名为part-00000的文件,结果存储在里面,可以直接在网页上下载到本地或用代码下载到本地.
hdfs -dfs -get
(/user/Leo/output/part-00000) <本地路径> 复制代码 本地路径>
总结
- 总体来说,Hadoop Streaming上手容易,主要难在对于Map-reduce的模式的理解上.
- 需要理解如下几点:
- Map-reduce(即Map-Shuffle-Reduce)的工作流程
- Hadoop streaming key/value的划分方式
- 结合自己的业务需求
- 满足业务所使用的脚本语言的特点进行编写.
优点: (方便,快捷)
- 只要支持stdin,stdout的语言都可以(Unix风格)
- 灵活,一般多用于处理一些临时任务,不用改动项目的代码结构
- 本地调试
缺点: (主要都是性能问题)
- 也正是由于因为stdin,stdout,数据传输交换的过程中,难免要对数据类型进行转换,所以会增加代码的执行时间.
补充
- 在Windows环境下出现的一个奇怪的Error
- 解决方法 ->
- 默认管理员可以创建符号表,可以使用管理员命令行启动 hadoop的应用
- 通过修改用户策略, 步骤如下:
- win+R -> gpedit.msc
- 计算机配置 -> windows设置 -> 安全设置 -> 本地策略 -> 用户权限分配 -> 创建符号链接
- 把用户添加进去,重启或者注销