API Functions

Here are some of the API functions provided with this package:

GPT2 Tokenizer

PPLM.load_pretrained_tokenizerFunction
load_pretrained_tokenizer(ty::Type{T}; unk_token="<|endoftext|>", eos_token="<|endoftext|>", pad_token="<|endoftext|>") where T<:PretrainedTokenizer

Load GPT2 tokenizer using Datadeps for pretrained bpe and vocab. Returns tokenizer as GPT2Tokenizer structure.

source
load_pretrained_tokenizer(path_bpe, path_vocab, unk_token, eos_token, pad_token)

Load pretrained tokenizer for GPT2 from provided bpe and vocab file path. Initialises unk_token, eos_token, pad_token as provided with the function. Returns tokenizer as GPT2Tokenizer structure.

source
PPLM.tokenizeFunction
tokenize(t::GPT2Tokenizer, text::AbstractString)

Function to tokenize given text with tokenizer bpe encoder (t.bpe_encode). Returns a string vector of tokens.

source
PPLM.encodeFunction
encode(t::GPT2Tokenizer, text::AbstractString; add_prefix_space=false)

Returns the encoded vector of tokens (mapping from vocab of Tokenizer) for text. If add_prefix_space=true, add space at the start of 'text' before tokenization.

Example

For single text:

encode(tokenizer, text)

For vector of text:

map(x->encode(tokenizer, x), text_vector) 
source
encode(t::GPT2Tokenizer, tokens::Vector{String})

Function to encode tokens vectors to their integer mapping from vocab of tokenizer.

source
PPLM.decodeFunction
decode(vocab::Vocabulary{T}, is::Vector{Int}) where T

Return decoded vector of string tokens from the indices vector is, using the vocab.

source
decode(t::GPT2Tokenizer, tokens_ids::Vector{Int})

Return decoded vector of string tokens from the indices vector tokens_ids, using the tokenizer t encoder .

source
PPLM.detokenizeFunction
detokenize(t::GPT2Tokenizer, tokens::Vector{String})

BPE Decode the vector of strings, using the tokenizer t.

source
detokenize(t::GPT2Tokenizer, tokens_ids::Vector{Int})

Decode and Detokenize the vector of indices token_ids. Returns the final sentence after detokenization.

Example

For single vector of token_ids:

detokenize(tokenizer, token_ids)

For vector of vector of token_ids, use:

map(x->decode(tokenizer, x), tokens_id_vector_of_vector)
source

Discriminator Model

General

PPLM.ClassifierHeadType

struct ClassifierHead linearlayer::Dense embedsize::Int class_size::Int end

Struct for ClassifiedHead, defined with a single linear layer and two paramters: embedsize-> Size of Embedding, classsize->Number of classes.

source
PPLM.get_discriminatorFunction
get_discriminator(model; load_from_pretrained=false, discrim=nothing, file_name=nothing, version=2, class_size::Int=1, embed_size::Int=768, path=nothing)

Function to create discriminator based on provided model. Incase, load_from_pretrained is set to be true, loads ClassifierHead layer from pretrained models or path provided.

source
PPLM.save_classifier_headFunction
save_classifier_head(cl_head; file_name=nothing, path=nothing, args=nothing, register_discrim=true, discrim_name="")

Function to save the ClassifiedHead as a BSON once the training is complete, based on the path provided. In case path is set as nothing, it saves the discriminators in ./pretrained_discriminators folder relative to the package directory.

source
PPLM.save_discriminatorFunction
save_discriminator(discrim, discrim_name="Custom"; file_name=nothing, path=nothing, args=nothing)

Function to save ClassifiedHead part of discriminator (by calling save_classifier_head function), which is the only trainable part of discriminator

source

Data Processing

PPLM.pad_seqFunction
pad_seq(batch::AbstractVector{T}, pad_token::Integer=0)

Function to add pad tokens in shorter sequence, to make the length of each sequence equal to the max_length ( calculated as max(map(length, batch))) in the batch. Pad token defaults to 0.

source
PPLM.get_maskFunction
get_mask(seq::AbstractMatrix{T}, pad_token::Integer=0, embed_size::Integer=768)

Function to create mask for sequences against padding, so as to inform the model, that some part of sequenece is padded and hence to be ignored.

source
PPLM.data_preprocessFunction
data_preprocess(data_x, data_y, classification_type::String="Binary", num_classes::Integer=2; args=nothing)

Function to preprocess data_x and data_y along with creating mask for the data_x.

Preprocessing for data_x consist of padding with pad token (expected to be provided as args.pad_token).

Preprocessing for data_y consist of creating onehotbach for data_y (if classification_type is not "Binary"), for 1:num_classes else reshape the data as (1, length(data_y))

Returns data_x, data_y, mask after pre-processing.

source
PPLM.load_dataFunction
load_data(data_x, data_y, tokenizer::PretrainedTokenizer;  batchsize::Integer=8, truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, drop_last::Bool=false, add_eos_start::Bool=true)

Returns DataLoader for the data_x and data_y after processing the datax, with batchsize=batchsize. The processing consist of tokenization of datax and further truncation to max_len if truncate is set to be true.

If add_eos_start is set to true, add EOS token of tokenizer to the start.

source
PPLM.load_cached_dataFunction
load_cached_data(discrim::Union{DiscriminatorV1, DiscriminatorV2}, data_x, data_y, tokenizer::PretrainedTokenizer; truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, batchsize::Int=4, drop_last::Bool=false, classification_type="Binary", num_classes=2, args=nothing)

Returns a DataLoader with (x, y) which can directly be feeded into classifier layer for training.

The function first loads the data using load_data function with batchsize=1, then passes each batch to the transformer model of discrim after data preprocessing, and then the average representation of the hidden_states are stored in a vector, which are then further loaded into a DataLoader, ready to use for classification training.

Note: This functions saves time by cacheing the average representation of hidden states beforehand, avoiding passing the data through model in each epoch of training. This can be done as the model itself is non-trainable while training discriminator classifier head.

source
PPLM.load_data_from_csvFunction
load_data_from_csv(path_to_csv; text_col="text", label_col="label", delim=',', header=1)

Load the data from a csv file based on the specified text_col column for text and label_col for target label. Returns vectors for text and label.

source

Training

PPLM.train!Function
train!(discrim, data_loader; args=args)

Train the discriminator using the provided data_loader training data and arguments args provided.

source
PPLM.test!Function
test!(discrim, data_loader; args=nothing)

Test the discriminator on test data provided using data_loader, based on Accuracy and NLL Loss.

source
PPLM.train_discriminatorFunction
train_discriminator(text, labels, batchsize::Int=8, classification_type::String="Binary", num_classes::Int=2; model="gpt2", cached::Bool=true, discrim=nothing, tokenizer=nothing, truncate=true, max_length=256, train_size::Float64=0.9, lr::Float64=1e-5, epochs::Int=10, args=nothing)

Function to train discriminator for provided text and target labels, based on set of function paramters provided. Returns discrim discriminator after training.

Here the cached=true allows cacheing of contexualized embeddings (forward pass) in GPT2 model, as the model itself is non-trainable. This reduces the time of training effectively as the forward pass through GPT2 model is to be done only once.

Example

Consider a Multiclass classification problem with class size of 5, it can trained on text and labels vectors using:

train_discriminator(text, labels, 16, "Multiclass", 5)
source

Bag of Words

PPLM.get_bow_indicesFunction
get_bow_indices(bow_key_or_path_list::Vector{String}, tokenizer)

Returns a list of list of indices of words from each Bag of word in the bow_key_or_path_list, after tokenization. The functions looks for provided BoW key in the registered artifacts Artifacts.toml file. In case not present there, function expects that bow_key is provided as the complete path to the file the URL to download .txt file.

Example

get_bow_indices(["legal", "military"])
source
PPLM.build_bow_oheFunction
build_bow_ohe(bow_indices, tokenizer)

Build and return a list of one_hot_matrix for each Bag Of Words list from indices. Each item of the list is of dimension (num_of_BoW_words, tokenizer.vocab_size).

Note: While building the OHE of word indices, it only keeps those words, which have length 1 after tokenization and discard the rest.

source

Generation

Normal

PPLM.sample_normalFunction
sample_normal(;prompt="I hate the customs", tokenizer=nothing, model=nothing, max_length=100, method="top_k", k=50, t=1.2, p=0.5, add_eos_start=true)

Function to generate normal Sentences with model and tokenizer provided. In case not provided, function itself create instance of GPT2-small tokenizer and LM Head Model. The sentences are started with the provided prompt, and generated till token length reaches max_length.

Two sampling methods of generation are provided with this function:

  1. method='top_k'
  2. method='nucleus'

Any of these methods can be used provided with either k or p.

source

PPLM

PPLM.sample_pplmFunction
function sample_pplm(pplm_args;tokenizer=nothing, model=nothing, prompt="I hate the customs", sample_method="top_k", add_eos_start=true)

Function for PPLM model based generation. Generate perturbed sentence using pplm_args, tokenizer and model (GPT2, in case not provided), starting with prompt. In this function the generation is based on the arguments/parameters provided in pplm_args, which is an instance of pplm struct.

source
PPLM.perturb_probsFunction
perturb_probs(probs, tokenizer, args)

Perturb probabilities probs based on provided Bag of Words list (as given with args). This function is supported only for BoW model.

source
PPLM.perturb_hidden_bowFunction
perturb_hidden_bow(hidden, model, tokenizer, args)

Perturb hidden states hidden based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.

Also checkout perturb_hidden_discrim

source
PPLM.perturb_past_bowFunction
perturb_past_bow(model, prev, past, original_probs, args)

Perturb past key values prev based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.

Also checkout perturb_past_discrim

source
PPLM.perturb_hidden_discrimFunction
perturb_hidden_discrim(hidden, model, tokenizer, args)

Perturb hidden states hidden based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.

Also checkout perturb_hidden_bow

source
PPLM.perturb_past_discrimFunction
perturb_past_discrim(model, prev, past, original_probs, args)

Perturb past key values prev based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.

Also checkout perturb_past_bow

source

Utils

PPLM.get_gpt2Function
get_gpt2()

Function to load gpt2 lmheadmodel along with the tokenizer.

source
PPLM.get_gpt2_mediumFunction
get_gpt2_medium()

Function to load gpt2-medium lmhead model along with the tokenizer.

Note: In case this function gives error of permission denied, try changing the file permissions for the Artifacts.toml file of Transformers.jl package (as it is read only by default) under the src/huggingface folder.

source
PPLM.set_deviceFunction
set_device(d_id=0)

Function to set cuda device if available and also to disallow scalar operations

source
PPLM.get_artifactFunction
get_artifact(name)

Utility function to download/install the artifact in case not already installed.

source
PPLM.register_custom_fileFunction
register_custom_file(artifact_name, file_name, path)

Function to register custom file under artifact_name in Artifacts.toml. path expects path of the directory where the file file_name is stored. Stores the complete path to the file as Artifact URL.

Example

register_custom_file('custom', 'xyz.txt','./folder/folder/')

Note: In case this gives permission denied error, change the Artifacts.toml file permissions using chmod(path_to_file_in_julia_installation , 0o764)or similar.

source
PPLM.top_k_sampleFunction
top_k_sample(probs; k=10)

Sampling function to return index from top_k probabilities, based on provided k. Function removes all tokens with a probability less than the last token of the top_k before sampling.

source
PPLM.nucleus_sampleFunction
nucleus_sample(probs; p=0.8)

Nuclues sampling function, to return after sampling reverse sorted probabilities probs till the index, where cumulative probability remains less than provided p. It removes tokens with cumulative probability above the threshold p before sampling.

source
PPLM.binary_accuracyFunction
binary_accuracy(y_pred, y_true; threshold=0.5)

Calculates Averaged Binary Accuracy based on y_pred and y_true. Argument threshold is used to specify the minimum predicted probability y_pred required to be labelled as 1. Default value set as 0.5.

source