単語出現頻度をSVMlight形式の学習データに変換する

文書分類などを行なう場合にはSVMlightやLIBSVMなどのライブラリを使います。
その場合素性名を数値にする必要があるので、
変換する関数converter.pyを作ってみました。
（Classiasという素性に任意の文字列が使えるライブラリもあります）

素性を数値に変換する場合

python converter.py -d sample.txt decoded.txt taiou

数値を素性に変換する場合

python converter.py -e decoded.txt sample.txt taiou

ここでsample.trnは元データ、decodedは素性を数値に変換したデータ、taiouは数値と素性の対応表を意味します。

例えば

+1 この:0.5 スープ:0.5 こく:0.5 ある:0.5
-1 この:0.447 スープ:0.447 調味料:0.447 入る:0.447 いる:0.447

というデータの素性を数値に変換すると

+1 1:0.500000 2:0.500000 3:0.500000 4:0.500000
-1 1:0.447000 2:0.447000 5:0.447000 6:0.447000 7:0.447000

となります。この場合の対応表は

1:この
2:スープ
3:こく
4:ある
5:調味料
6:入る
7:いる

となっています。

converter.py

# -*- coding: utf-8 -*-
from optparse import OptionParser
import sys

def decode(src, dst, table):
    """素性名を素性IDに戻し、SVMlight用の学習データを作る"""
    # SVMlight用の学習データを出力する
    words = []
    with open(src) as fr:
        with open(dst, "w") as fw:
            for line in fr:
                line = line.strip()
                features = line.split(" ")
                label = features.pop(0)
                fw.write(label)
                for feature in features:
                    (word, weight) = feature.split(":")
                    if word not in words:
                        words.append(word)
                    idno = words.index(word) + 1
                    weight = float(weight)
                    fw.write(" %d:%f" % (idno, weight))
                fw.write("\n")

    # IDと単語の対応を出力する
    with open(table, "w") as fw:
        for idno in xrange(len(words)):
            word = words[idno]
            fw.write("%d:%s\n" % (idno+1, word))
        
def encode(src, dst, table):
    """素性IDを素性名に変換する"""
    # IDと単語の対応を読み込む
    words = []
    with open(table) as fr:
        count = 1
        for line in fr:
            line = line.strip()
            (idno, word) = line.split(":")
            idno = int(idno)
            if count != idno:
                exit("encode error")
            words.append(word)
            count += 1

    # SVMlight用の学習データを変換する
    with open(src) as fr:
        with open(dst, "w") as fw:
            for line in fr:
                line = line.strip()
                features = line.split(" ")
                label = features.pop(0)
                fw.write(label)
                for feature in features:
                    (idno, weight) = feature.split(":")
                    idno = int(idno)
                    word = words[idno-1]
                    weight = float(weight)
                    fw.write(" %s:%f" % (word, weight))
                fw.write("\n")

def main():
    # オプションの解析
    usage = "Usage: python %prog (-e|-d) src dst table"
    parser = OptionParser(usage)
    parser.set_defaults(method=0)
    parser.add_option("-d", "--decode",
                      action="store_const", const=1, dest="method",
                      help="convert word to id")
    parser.add_option("-e", "--encode",
                      action="store_const", const=2, dest="method",
                      help="convert id to word")
    (options, args) = parser.parse_args()
    if len(args) != 3: 
        exit(usage)

    # データの変換
    if options.method == 1:
        decode(args[0], args[1], args[2])
    elif options.method == 2:
        encode(args[0], args[1], args[2])
    else:
        exit(usage)

if __name__ == "__main__":
    main()

tanihito’s blog

デジタル・新規事業開発・健康など、興味のあることについてつらつらと書いてきます。

単語出現頻度をSVMlight形式の学習データに変換する