pyeeg.io.WordLevelFeatures

class pyeeg.io.WordLevelFeatures(path_praat_env=None, path_surprisal=None, path_wordvectors=None, path_wordfrequency=None, path_syntactic=None, path_wordonsets=None, path_transcript=None, rnnlm_model=None, path_audio=None, keep_vectors=False, unk_wv='rdm')

Gather word-level linguistic features based on old and rigid files, generated from various sources (RNNLM, custom scripts for word frequencies, Forced-Alignment toolkit, …)

Parameters:
  • path_praat_env (str) – Path to Praat file for envelopes (.Env files in Katerina_experiment/story-parts/alignment_data/)

  • path_surprisal (str) – Path to file containing surprisal values, as extracted with RNNLM toolkit.

  • path_wordvectors (str) – Path to Word Vector file, can be binary or textual. Though for text, must start with a line indicating number of words and dimension (i.e. gensim compatible)

  • path_wordfrequency (str) – Path to file containing word counts as exctracted from Google Unigrams.

  • path_syntactic (str) – Path to file containing all following syntactic features: depth in syntactic structure parse, opening node, closing node

  • path_wordonsets (str) – Important! Path to file with word onsets. For now, this must be one of the *_timed.csv files, for instance as in Katerina_experiment/story-parts/wordfrequencies/*

  • path_transcript (str) – Path to actual text data of corresponding speech segment (transcript).

  • path_audio (str) – Path to audio file

  • rnnlm_model (str) – Path to RNNLM toolkit to compute surprisal and entropy values from a text.

  • keep_vectors (bool (default: False)) – If true, will keep the full matrix of all word vectors in memory along with vocabulary (see self.wordvectors_matrix)

  • unk_wv (str {'skip, 'rdm', 'closest'}) – See parameter unk in :func:’get_word_vectors’

duration

Duration in seconds of the speech segment represented by this instance

Type:

float

surprisal

Surprisal values for each words

Type:

array-like

wordfrequency

Word frequency values for each words

Type:

array-like

wordonsets

Onsets values for each words

Type:

array-like

wordlist

List of words from which each feature is being measured

Type:

list[str]

wordvectors

Matrix of word vectors, i.e. concatenation of word vector correpsonding to words in the text

Type:

array-like (nwords x ndimension)

vectordim

Number of dimensions in the word embedding currently loaded.

Type:

int

wordvectors_matrix

All word vectors (only if keep_vectors is True)

Type:

ndarray (nvocab x ndims)

wordvectors_vocab

Vocabulary associated with word vectors (only if full matrix loaded)

Type:

dict

Examples

>>> from pyeeg.io import WordLevelFeatures
>>> import os
>>> from os.path import join
>>> wordfreq_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/word_frequencies/'
>>> env_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/alignement_data/'
>>> surprisal_path = '/media/hw2512/SeagateExpansionDrive/EEG_data/Katerina_experiment/story_parts/surprisal/'
>>> list_wordfreq_files = [item for item in os.listdir(wordfreq_path) if item.endswith('timed.csv')]
>>> list_surprisal_files = [item for item in os.listdir(surprisal_path) if item.endswith('3.txt')]
>>> list_stories = [item.strip('_word_freq_timed.csv') for item in list_wordfreq_files]
>>> list_env_files = [os.path.join(env_path, s, s + '_125Hz.Env') for s in list_stories]
>>> surp_path = os.path.join(surprisal_path, list_surprisal_files[1])
>>> wf_path = os.path.join(wordfreq_path, list_wordfreq_files[1])
>>> dur_path = os.path.join(env_path, list_env_files[1])
>>> wo_path = os.path.join(wordfreq_path, list_wordfreq_files[1])
>>> wfeats = WordLevelFeatures(path_praat_env=dur_path, path_wordonsets=wo_path, path_surprisal=surp_path, path_wordfrequency=wf_path)
>>> wf
   surprisal   wordfreq      words     onsets
0   6.202895   9.631348         if   9.631348
1  11.555537   6.981747        the   6.981747
2  25.527839  13.521181  afternoon  13.521181
3  27.631021   8.300234        was   8.300234
4   9.720567  12.214926       fine  12.214926
Duration: 159.08
>>> feature_matrix = wf.align_word_features(srate=125, features=['surprisal', 'wordfrequency'])

Note

For now, the class assumes that the data are stored in a given format that depended on the processing done for the work on surprisal. However it ought to be extended to be bale to generate surprisal/word frequency/word onsets/word vectors for a given audio files along with its transcript.

Todo

  • Extend the class to create word-feature of choices on-the-fly

  • Be able to extract word onsets from TextGrid data?

Methods

WordLevelFeatures.align_word_features(srate)

Check that we have the correct number of elements for each features Will output a time series with spike for each word features at times from self.wordonsets.

WordLevelFeatures.summary()

Print a short summary for each variables contained in the instance