/var/www/yatta47.log

やったのログ置場です。スクラップみたいな短編が多いかと。

Cloud9環境にmecabをインストール、文字化けした時の対処法。

最近のマイブーム、形態素解析をやるためにcloud9環境にmecabをインストールしました。

mecabとは

オープンソース形態素解析エンジンです。最終的にはこれをRubyから使うつもりで今は勉強中です。

MeCabオープンソース形態素解析エンジンで、奈良先端科学技術大学院大学出身、現GoogleソフトウェアエンジニアでGoogle 日本語入力開発者の一人である工藤拓によって開発されている。 名称は開発者の好物「和布蕪(めかぶ)」から取られた。
MeCabとは - Weblio辞書 より )

インストール手順

インストールした環境はこちら。

$ uname -a  
Linux XXXXXXXXXXXXXX 4.2.0-c9 #1 SMP Wed Sep 30 16:14:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

じっさいのインストールコマンドライン

$ sudo apt-get install mecab

スクリプトログはこちら。

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  libmecab2 mecab-jumandic mecab-utils
The following NEW packages will be installed:
  libmecab2 mecab mecab-jumandic mecab-utils
0 upgraded, 4 newly installed, 0 to remove and 35 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 81.4 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://archive.ubuntu.com/ubuntu/ trusty/universe libmecab2 amd64 0.996-1.1 [244 kB]
Get:2 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-utils amd64 0.996-1.1 [4154 B]
Get:3 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-jumandic all 5.1+20070304-3 [13.0 MB]
Get:4 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab amd64 0.996-1.1 [83.2 kB]
Fetched 13.3 MB in 3s (4130 kB/s) 
Selecting previously unselected package libmecab2.
(Reading database ... 125961 files and directories currently installed.)
Preparing to unpack .../libmecab2_0.996-1.1_amd64.deb ...
Unpacking libmecab2 (0.996-1.1) ...
Selecting previously unselected package mecab-utils.
Preparing to unpack .../mecab-utils_0.996-1.1_amd64.deb ...
Unpacking mecab-utils (0.996-1.1) ...
Selecting previously unselected package mecab-jumandic.
Preparing to unpack .../mecab-jumandic_5.1+20070304-3_all.deb ...
Unpacking mecab-jumandic (5.1+20070304-3) ...
Selecting previously unselected package mecab.
Preparing to unpack .../mecab_0.996-1.1_amd64.deb ...
Unpacking mecab (0.996-1.1) ...
Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
Setting up libmecab2 (0.996-1.1) ...
Setting up mecab-utils (0.996-1.1) ...
Setting up mecab-jumandic (5.1+20070304-3) ...
Compiling Juman dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/juman/unk.def ... 37
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/juman/model.def is not found. skipped.
reading /usr/share/mecab/dic/juman/Demonstrative.csv ... 76
reading /usr/share/mecab/dic/juman/Prefix.csv ... 75
reading /usr/share/mecab/dic/juman/AuxV.csv ... 421
reading /usr/share/mecab/dic/juman/Suffix.csv ... 1163
reading /usr/share/mecab/dic/juman/Postp.csv ... 104
reading /usr/share/mecab/dic/juman/Noun.hukusi.csv ... 74
reading /usr/share/mecab/dic/juman/Noun.keishiki.csv ... 10
reading /usr/share/mecab/dic/juman/Rengo.csv ... 913
reading /usr/share/mecab/dic/juman/Noun.suusi.csv ... 46
reading /usr/share/mecab/dic/juman/Assert.csv ... 30
reading /usr/share/mecab/dic/juman/Special.csv ... 124
reading /usr/share/mecab/dic/juman/ContentW.csv ... 483161
reading /usr/share/mecab/dic/juman/Noun.koyuu.csv ... 29805
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/juman/matrix.def ... 1509x1509
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/juman to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
Setting up mecab (0.996-1.1) ...
Compiling Juman dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/juman/unk.def ... 37
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/juman/model.def is not found. skipped.
reading /usr/share/mecab/dic/juman/Demonstrative.csv ... 76
reading /usr/share/mecab/dic/juman/Prefix.csv ... 75
reading /usr/share/mecab/dic/juman/AuxV.csv ... 421
reading /usr/share/mecab/dic/juman/Suffix.csv ... 1163
reading /usr/share/mecab/dic/juman/Postp.csv ... 104
reading /usr/share/mecab/dic/juman/Noun.hukusi.csv ... 74
reading /usr/share/mecab/dic/juman/Noun.keishiki.csv ... 10
reading /usr/share/mecab/dic/juman/Rengo.csv ... 913
reading /usr/share/mecab/dic/juman/Noun.suusi.csv ... 46
reading /usr/share/mecab/dic/juman/Assert.csv ... 30
reading /usr/share/mecab/dic/juman/Special.csv ... 124
reading /usr/share/mecab/dic/juman/ContentW.csv ... 483161
reading /usr/share/mecab/dic/juman/Noun.koyuu.csv ... 29805
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/juman/matrix.def ... 1509x1509
emitting matrix      : 100% |###########################################| 

done!
Processing triggers for libc-bin (2.19-0ubuntu6.6) ...

うむ。一応インストールできた様子。早速使ってみます。

$ mecab 
あああああああああああああああああああああああああああああああああああああああああああああああああああああああ
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�����*,*,*,*,*
��      ����*,*,*,*,*
����*,*,*,*,*
�ああああああああああああああああ       ����*,*,*,*,*
EOS

!?!?!?!??!?!?文字化け

なんでやねん。

色々と調べた結果、辞書の文字コードがあっていないと起こるらしい。

辞書を追加

ってことでutf-8の辞書を追加。

$ sudo apt-get install mecab-ipadic-utf8
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  mecab-ipadic
The following NEW packages will be installed:
  mecab-ipadic mecab-ipadic-utf8
0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded.
Need to get 12.1 MB of archives.
After this operation, 54.4 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-ipadic all 2.7.0-20070801+main-1 [12.1 MB]
Get:2 http://archive.ubuntu.com/ubuntu/ trusty/universe mecab-ipadic-utf8 all 2.7.0-20070801+main-1 [3522 B]
Fetched 12.1 MB in 2s (4455 kB/s)            
Selecting previously unselected package mecab-ipadic.
(Reading database ... 126038 files and directories currently installed.)
Preparing to unpack .../mecab-ipadic_2.7.0-20070801+main-1_all.deb ...
Unpacking mecab-ipadic (2.7.0-20070801+main-1) ...
Selecting previously unselected package mecab-ipadic-utf8.
Preparing to unpack .../mecab-ipadic-utf8_2.7.0-20070801+main-1_all.deb ...
Unpacking mecab-ipadic-utf8 (2.7.0-20070801+main-1) ...
Setting up mecab-ipadic (2.7.0-20070801+main-1) ...
Compiling IPA dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/ipadic/unk.def ... 40
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/ipadic to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-1) ...
Compiling IPA dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/ipadic/unk.def ... 40
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode

よし。そして再度チャレンジ。

$ mecab 
ああいう
ああ    感動詞,*,*,*,*,*,ああ,アア,アー
いう    動詞,自立,*,*,五段・ワ行促音便,基本形,いう,イウ,イウ
EOS

よし。とりあえず文字化けがなくなった。これでmecabが使えるようになった。

まとめ

cloud9の環境にて、mecabを使うようにするには以下の2つをインストール。

$ sudo apt-get install mecab
$ sudo apt-get install mecab-ipadic-utf8

辞書とかその辺の整備をしていないから精度は全然かも知れないけどそれは今後調整していこう。

次はこのインストールしたmecabrubyから操作してみようと思います。

参考にしたサイト

さくら共有サーバー、UTF-8の辞書でmecabを使う方法 : nymemo

インストール/MeCabのインストール | GETAssoc

MeCabで文字コードの違う辞書を使う - Qiita

Rによるテキストマイニング入門

Rによるテキストマイニング入門

広告を非表示にする
Real Time Web Analytics