最近のマイブーム、形態素解析をやるためにcloud9環境にmecabをインストールしました。
mecabとは
オープンソースの形態素解析エンジンです。最終的にはこれをRubyから使うつもりで今は勉強中です。
MeCabはオープンソースの形態素解析エンジンで、奈良先端科学技術大学院大学出身、現GoogleソフトウェアエンジニアでGoogle 日本語入力開発者の一人である工藤拓によって開発されている。 名称は開発者の好物「和布蕪(めかぶ)」から取られた。
(MeCabとは - Weblio辞書 より )
インストール手順
インストールした環境はこちら。
$ uname -a Linux XXXXXXXXXXXXXX 4.2.0-c9 #1 SMP Wed Sep 30 16:14:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
じっさいのインストールコマンドライン。
$ sudo apt-get install mecab
スクリプトログはこちら。
Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: libmecab2 mecab-jumandic mecab-utils The following NEW packages will be installed: libmecab2 mecab mecab-jumandic mecab-utils 0 upgraded, 4 newly installed, 0 to remove and 35 not upgraded. Need to get 13.3 MB of archives. After this operation, 81.4 MB of additional disk space will be used. Do you want to continue? [Y/n] Y Get:1 http://archive.ubuntu.com/ubuntu/ trusty/universe libmecab2 amd64 0.996-1.1 [244 kB] Get:2 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-utils amd64 0.996-1.1 [4154 B] Get:3 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-jumandic all 5.1+20070304-3 [13.0 MB] Get:4 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab amd64 0.996-1.1 [83.2 kB] Fetched 13.3 MB in 3s (4130 kB/s) Selecting previously unselected package libmecab2. (Reading database ... 125961 files and directories currently installed.) Preparing to unpack .../libmecab2_0.996-1.1_amd64.deb ... Unpacking libmecab2 (0.996-1.1) ... Selecting previously unselected package mecab-utils. Preparing to unpack .../mecab-utils_0.996-1.1_amd64.deb ... Unpacking mecab-utils (0.996-1.1) ... Selecting previously unselected package mecab-jumandic. Preparing to unpack .../mecab-jumandic_5.1+20070304-3_all.deb ... Unpacking mecab-jumandic (5.1+20070304-3) ... Selecting previously unselected package mecab. Preparing to unpack .../mecab_0.996-1.1_amd64.deb ... Unpacking mecab (0.996-1.1) ... Processing triggers for man-db (2.6.7.1-1ubuntu1) ... Setting up libmecab2 (0.996-1.1) ... Setting up mecab-utils (0.996-1.1) ... Setting up mecab-jumandic (5.1+20070304-3) ... Compiling Juman dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/juman/unk.def ... 37 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/juman/model.def is not found. skipped. reading /usr/share/mecab/dic/juman/Demonstrative.csv ... 76 reading /usr/share/mecab/dic/juman/Prefix.csv ... 75 reading /usr/share/mecab/dic/juman/AuxV.csv ... 421 reading /usr/share/mecab/dic/juman/Suffix.csv ... 1163 reading /usr/share/mecab/dic/juman/Postp.csv ... 104 reading /usr/share/mecab/dic/juman/Noun.hukusi.csv ... 74 reading /usr/share/mecab/dic/juman/Noun.keishiki.csv ... 10 reading /usr/share/mecab/dic/juman/Rengo.csv ... 913 reading /usr/share/mecab/dic/juman/Noun.suusi.csv ... 46 reading /usr/share/mecab/dic/juman/Assert.csv ... 30 reading /usr/share/mecab/dic/juman/Special.csv ... 124 reading /usr/share/mecab/dic/juman/ContentW.csv ... 483161 reading /usr/share/mecab/dic/juman/Noun.koyuu.csv ... 29805 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/juman/matrix.def ... 1509x1509 emitting matrix : 100% |###########################################| done! update-alternatives: using /var/lib/mecab/dic/juman to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode Setting up mecab (0.996-1.1) ... Compiling Juman dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/juman/unk.def ... 37 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/juman/model.def is not found. skipped. reading /usr/share/mecab/dic/juman/Demonstrative.csv ... 76 reading /usr/share/mecab/dic/juman/Prefix.csv ... 75 reading /usr/share/mecab/dic/juman/AuxV.csv ... 421 reading /usr/share/mecab/dic/juman/Suffix.csv ... 1163 reading /usr/share/mecab/dic/juman/Postp.csv ... 104 reading /usr/share/mecab/dic/juman/Noun.hukusi.csv ... 74 reading /usr/share/mecab/dic/juman/Noun.keishiki.csv ... 10 reading /usr/share/mecab/dic/juman/Rengo.csv ... 913 reading /usr/share/mecab/dic/juman/Noun.suusi.csv ... 46 reading /usr/share/mecab/dic/juman/Assert.csv ... 30 reading /usr/share/mecab/dic/juman/Special.csv ... 124 reading /usr/share/mecab/dic/juman/ContentW.csv ... 483161 reading /usr/share/mecab/dic/juman/Noun.koyuu.csv ... 29805 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/juman/matrix.def ... 1509x1509 emitting matrix : 100% |###########################################| done! Processing triggers for libc-bin (2.19-0ubuntu6.6) ...
うむ。一応インストールできた様子。早速使ってみます。
$ mecab あああああああああああああああああああああああああああああああああああああああああああああああああああああああ ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �����*,*,*,*,* �� ����*,*,*,*,* ����*,*,*,*,* �ああああああああああああああああ ����*,*,*,*,* EOS
!?!?!?!??!?!?文字化け
なんでやねん。
色々と調べた結果、辞書の文字コードがあっていないと起こるらしい。
辞書を追加
ってことでutf-8の辞書を追加。
$ sudo apt-get install mecab-ipadic-utf8 Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: mecab-ipadic The following NEW packages will be installed: mecab-ipadic mecab-ipadic-utf8 0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded. Need to get 12.1 MB of archives. After this operation, 54.4 MB of additional disk space will be used. Do you want to continue? [Y/n] Y Get:1 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-ipadic all 2.7.0-20070801+main-1 [12.1 MB] Get:2 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-ipadic-utf8 all 2.7.0-20070801+main-1 [3522 B] Fetched 12.1 MB in 2s (4455 kB/s) Selecting previously unselected package mecab-ipadic. (Reading database ... 126038 files and directories currently installed.) Preparing to unpack .../mecab-ipadic_2.7.0-20070801+main-1_all.deb ... Unpacking mecab-ipadic (2.7.0-20070801+main-1) ... Selecting previously unselected package mecab-ipadic-utf8. Preparing to unpack .../mecab-ipadic-utf8_2.7.0-20070801+main-1_all.deb ... Unpacking mecab-ipadic-utf8 (2.7.0-20070801+main-1) ... Setting up mecab-ipadic (2.7.0-20070801+main-1) ... Compiling IPA dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/ipadic/unk.def ... 40 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/ipadic/model.def is not found. skipped. reading /usr/share/mecab/dic/ipadic/Others.csv ... 2 reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221 reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393 reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199 reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19 reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328 reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171 reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146 reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795 reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42 reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668 reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135 reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151 reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327 reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252 reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032 reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120 reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210 reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91 reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42 reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208 reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750 reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146 reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999 reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202 reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################| done! update-alternatives: using /var/lib/mecab/dic/ipadic to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-1) ... Compiling IPA dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/ipadic/unk.def ... 40 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/ipadic/model.def is not found. skipped. reading /usr/share/mecab/dic/ipadic/Others.csv ... 2 reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221 reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393 reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199 reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19 reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328 reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171 reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146 reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795 reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42 reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668 reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135 reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151 reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327 reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252 reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032 reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120 reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210 reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91 reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42 reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208 reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750 reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146 reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999 reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202 reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################| done! update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
よし。そして再度チャレンジ。
$ mecab ああいう ああ 感動詞,*,*,*,*,*,ああ,アア,アー いう 動詞,自立,*,*,五段・ワ行促音便,基本形,いう,イウ,イウ EOS
よし。とりあえず文字化けがなくなった。これでmecabが使えるようになった。
まとめ
cloud9の環境にて、mecabを使うようにするには以下の2つをインストール。
$ sudo apt-get install mecab $ sudo apt-get install mecab-ipadic-utf8
辞書とかその辺の整備をしていないから精度は全然かも知れないけどそれは今後調整していこう。
次はこのインストールしたmecabをrubyから操作してみようと思います。