pyeeg.io
.WordLevelFeatures
- class pyeeg.io.WordLevelFeatures(path_praat_env=None, path_surprisal=None, path_wordvectors=None, path_wordfrequency=None, path_syntactic=None, path_wordonsets=None, path_transcript=None, rnnlm_model=None, path_audio=None, keep_vectors=False, unk_wv='rdm')
Gather word-level linguistic features based on old and rigid files, generated from various sources (RNNLM, custom scripts for word frequencies, Forced-Alignment toolkit, …)
- Parameters:
path_praat_env (str) – Path to Praat file for envelopes (.Env files in Katerina_experiment/story-parts/alignment_data/)
path_surprisal (str) – Path to file containing surprisal values, as extracted with RNNLM toolkit.
path_wordvectors (str) – Path to Word Vector file, can be binary or textual. Though for text, must start with a line indicating number of words and dimension (i.e. gensim compatible)
path_wordfrequency (str) – Path to file containing word counts as exctracted from Google Unigrams.
path_syntactic (str) – Path to file containing all following syntactic features: depth in syntactic structure parse, opening node, closing node
path_wordonsets (str) – Important! Path to file with word onsets. For now, this must be one of the *_timed.csv files, for instance as in Katerina_experiment/story-parts/wordfrequencies/*
path_transcript (str) – Path to actual text data of corresponding speech segment (transcript).
path_audio (str) – Path to audio file
rnnlm_model (str) – Path to RNNLM toolkit to compute surprisal and entropy values from a text.
keep_vectors (bool (default: False)) – If true, will keep the full matrix of all word vectors in memory along with vocabulary (see
self.wordvectors_matrix
)unk_wv (str {'skip, 'rdm', 'closest'}) – See parameter unk in :func:’get_word_vectors’
- duration
Duration in seconds of the speech segment represented by this instance
- Type:
float
- surprisal
Surprisal values for each words
- Type:
array-like
- wordfrequency
Word frequency values for each words
- Type:
array-like
- wordonsets
Onsets values for each words
- Type:
array-like
- wordlist
List of words from which each feature is being measured
- Type:
list[str]
- wordvectors
Matrix of word vectors, i.e. concatenation of word vector correpsonding to words in the text
- Type:
array-like (nwords x ndimension)
- vectordim
Number of dimensions in the word embedding currently loaded.
- Type:
int
- wordvectors_matrix
All word vectors (only if
keep_vectors
is True)- Type:
ndarray (nvocab x ndims)
- wordvectors_vocab
Vocabulary associated with word vectors (only if full matrix loaded)
- Type:
dict
Examples
>>> from pyeeg.io import WordLevelFeatures >>> import os >>> from os.path import join >>> wordfreq_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/word_frequencies/' >>> env_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/alignement_data/' >>> surprisal_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/surprisal/' >>> list_wordfreq_files = [item for item in os.listdir(wordfreq_path) if item.endswith('timed.csv')] >>> list_surprisal_files = [item for item in os.listdir(surprisal_path) if item.endswith('3.txt')] >>> list_stories = [item.strip('_word_freq_timed.csv') for item in list_wordfreq_files] >>> list_env_files = [os.path.join(env_path, s, s + '_125Hz.Env') for s in list_stories] >>> surp_path = os.path.join(surprisal_path, list_surprisal_files[1]) >>> wf_path = os.path.join(wordfreq_path, list_wordfreq_files[1]) >>> dur_path = os.path.join(env_path, list_env_files[1]) >>> wo_path = os.path.join(wordfreq_path, list_wordfreq_files[1]) >>> wfeats = WordLevelFeatures(path_praat_env=dur_path, path_wordonsets=wo_path, path_surprisal=surp_path, path_wordfrequency=wf_path) >>> wf surprisal wordfreq words onsets 0 6.202895 9.631348 if 9.631348 1 11.555537 6.981747 the 6.981747 2 25.527839 13.521181 afternoon 13.521181 3 27.631021 8.300234 was 8.300234 4 9.720567 12.214926 fine 12.214926 Duration: 159.08 >>> feature_matrix = wf.align_word_features(srate=125, features=['surprisal', 'wordfrequency'])
Note
For now, the class assumes that the data are stored in a given format that depended on the processing done for the work on surprisal. However it ought to be extended to be bale to generate surprisal/word frequency/word onsets/word vectors for a given audio files along with its transcript.
Todo
Extend the class to create word-feature of choices on-the-fly
Be able to extract word onsets from TextGrid data?
Methods
Check that we have the correct number of elements for each features Will output a time series with spike for each word features at times from
self.wordonsets
.Print a short summary for each variables contained in the instance