Basic feature extraction techniques for text.

Dummy DataSet:

x_train = ["Sample one", "Sample one", "Sample one"]
x_train = [x.split() for x in x_train]
x_train
[['Sample', 'one'], ['Sample', 'one'], ['Sample', 'one']]
x_train = [["Some", "b", "a"], ["a", "b"], ["c", "b"], ["d", "b"]]
x_test  = [["a", "e"], ["a"], ["c", "b", "b"], ["c"]]
y_train = ["class 1","class 1","class 2","class 3"]

class CountVectorizer[source]

CountVectorizer(store_class_vocab=False)

Implementation of Bag of Word Model. Assign zero to terms that don't occur in vocabulary

cv =  CountVectorizer(store_class_vocab = True )
cv.fit(x_train, y_train)
cv.vocab
('a', 'b', 'c', 'd')
cv.store_class_vocab
{'class 1': defaultdict(int, {'a': 3, 'b': 2}),
 'class 2': defaultdict(int, {'c': 1, 'b': 1}),
 'class 3': defaultdict(int, {'d': 1, 'b': 1})}
x_train = cv.transform(x_train).tocsr() 
x_test = cv.transform(x_test).tocsr() 
x_train.toarray()
array([[2, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 1, 1, 0],
       [0, 1, 0, 1]], dtype=int64)
x_test.toarray()
array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 2, 1, 0],
       [0, 0, 1, 0]], dtype=int64)