I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure.
In terms of getting counts for occurrence:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
gives:
[[0 0 0 0]]
What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams:
vocabulary = ['hi', 'bye', 'run']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
which gives:
[[0 0 1]]
Is there any way to tell the CountVectorizer exactly how you'd like to vectorize the corpus? Ideally I would like an outcome along the lines of the first example.
In all honestly, however, I'm wondering if it is at all possible to get an outcome along these lines:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['I want to run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()[[0 0 1]]
I don't see much information in the documentation for the fit_transform method, which only takes one argument as it is. If anyone has any ideas I would be grateful. Thanks!