API Functions

Here are some of the API functions provided with this package:

GPT2 Tokenizer

PPLM.load_pretrained_tokenizer — Function

load_pretrained_tokenizer(ty::Type{T}; unk_token="<|endoftext|>", eos_token="<|endoftext|>", pad_token="<|endoftext|>") where T<:PretrainedTokenizer

Load GPT2 tokenizer using Datadeps for pretrained bpe and vocab. Returns tokenizer as GPT2Tokenizer structure.

source

load_pretrained_tokenizer(path_bpe, path_vocab, unk_token, eos_token, pad_token)

Load pretrained tokenizer for GPT2 from provided bpe and vocab file path. Initialises unk_token, eos_token, pad_token as provided with the function. Returns tokenizer as GPT2Tokenizer structure.

source

PPLM.tokenize — Function

tokenize(t::GPT2Tokenizer, text::AbstractString)

Function to tokenize given text with tokenizer bpe encoder (t.bpe_encode). Returns a string vector of tokens.

source

PPLM.encode — Function

encode(t::GPT2Tokenizer, text::AbstractString; add_prefix_space=false)

Returns the encoded vector of tokens (mapping from vocab of Tokenizer) for text. If add_prefix_space=true, add space at the start of 'text' before tokenization.

Example

For single text:

encode(tokenizer, text)

For vector of text:

map(x->encode(tokenizer, x), text_vector)

source

encode(t::GPT2Tokenizer, tokens::Vector{String})

Function to encode tokens vectors to their integer mapping from vocab of tokenizer.

source

PPLM.decode — Function

decode(vocab::Vocabulary{T}, is::Vector{Int}) where T

Return decoded vector of string tokens from the indices vector is, using the vocab.

source

decode(t::GPT2Tokenizer, tokens_ids::Vector{Int})

Return decoded vector of string tokens from the indices vector tokens_ids, using the tokenizer t encoder .

source

PPLM.detokenize — Function

detokenize(t::GPT2Tokenizer, tokens::Vector{String})

BPE Decode the vector of strings, using the tokenizer t.

source

detokenize(t::GPT2Tokenizer, tokens_ids::Vector{Int})

Decode and Detokenize the vector of indices token_ids. Returns the final sentence after detokenization.

Example

For single vector of token_ids:

detokenize(tokenizer, token_ids)

For vector of vector of token_ids, use:

map(x->decode(tokenizer, x), tokens_id_vector_of_vector)

source

Discriminator Model

General

PPLM.ClassifierHead — Type

struct ClassifierHead linearlayer::Dense embedsize::Int class_size::Int end

Struct for ClassifiedHead, defined with a single linear layer and two paramters: embedsize-> Size of Embedding, classsize->Number of classes.

source

PPLM.get_discriminator — Function

get_discriminator(model; load_from_pretrained=false, discrim=nothing, file_name=nothing, version=2, class_size::Int=1, embed_size::Int=768, path=nothing)

Function to create discriminator based on provided model. Incase, load_from_pretrained is set to be true, loads ClassifierHead layer from pretrained models or path provided.

source

PPLM.save_classifier_head — Function

save_classifier_head(cl_head; file_name=nothing, path=nothing, args=nothing, register_discrim=true, discrim_name="")

Function to save the ClassifiedHead as a BSON once the training is complete, based on the path provided. In case path is set as nothing, it saves the discriminators in ./pretrained_discriminators folder relative to the package directory.

source

PPLM.save_discriminator — Function

save_discriminator(discrim, discrim_name="Custom"; file_name=nothing, path=nothing, args=nothing)

Function to save ClassifiedHead part of discriminator (by calling save_classifier_head function), which is the only trainable part of discriminator

source

Data Processing

PPLM.pad_seq — Function

pad_seq(batch::AbstractVector{T}, pad_token::Integer=0)

Function to add pad tokens in shorter sequence, to make the length of each sequence equal to the max_length ( calculated as max(map(length, batch))) in the batch. Pad token defaults to 0.

source

PPLM.get_mask — Function

get_mask(seq::AbstractMatrix{T}, pad_token::Integer=0, embed_size::Integer=768)

Function to create mask for sequences against padding, so as to inform the model, that some part of sequenece is padded and hence to be ignored.

source

PPLM.data_preprocess — Function

data_preprocess(data_x, data_y, classification_type::String="Binary", num_classes::Integer=2; args=nothing)

Function to preprocess data_x and data_y along with creating mask for the data_x.

Preprocessing for data_x consist of padding with pad token (expected to be provided as args.pad_token).

Preprocessing for data_y consist of creating onehotbach for data_y (if classification_type is not "Binary"), for 1:num_classes else reshape the data as (1, length(data_y))

Returns data_x, data_y, mask after pre-processing.

source

PPLM.load_data — Function

load_data(data_x, data_y, tokenizer::PretrainedTokenizer;  batchsize::Integer=8, truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, drop_last::Bool=false, add_eos_start::Bool=true)

Returns DataLoader for the data_x and data_y after processing the datax, with batchsize=batchsize. The processing consist of tokenization of datax and further truncation to max_len if truncate is set to be true.

If add_eos_start is set to true, add EOS token of tokenizer to the start.

source

PPLM.load_cached_data — Function

load_cached_data(discrim::Union{DiscriminatorV1, DiscriminatorV2}, data_x, data_y, tokenizer::PretrainedTokenizer; truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, batchsize::Int=4, drop_last::Bool=false, classification_type="Binary", num_classes=2, args=nothing)

Returns a DataLoader with (x, y) which can directly be feeded into classifier layer for training.

The function first loads the data using load_data function with batchsize=1, then passes each batch to the transformer model of discrim after data preprocessing, and then the average representation of the hidden_states are stored in a vector, which are then further loaded into a DataLoader, ready to use for classification training.

Note: This functions saves time by cacheing the average representation of hidden states beforehand, avoiding passing the data through model in each epoch of training. This can be done as the model itself is non-trainable while training discriminator classifier head.

source

PPLM.load_data_from_csv — Function

load_data_from_csv(path_to_csv; text_col="text", label_col="label", delim=',', header=1)

Load the data from a csv file based on the specified text_col column for text and label_col for target label. Returns vectors for text and label.

source

Training

PPLM.train! — Function

train!(discrim, data_loader; args=args)

Train the discriminator using the provided data_loader training data and arguments args provided.

source

PPLM.test! — Function

test!(discrim, data_loader; args=nothing)

Test the discriminator on test data provided using data_loader, based on Accuracy and NLL Loss.

source

PPLM.train_discriminator — Function

train_discriminator(text, labels, batchsize::Int=8, classification_type::String="Binary", num_classes::Int=2; model="gpt2", cached::Bool=true, discrim=nothing, tokenizer=nothing, truncate=true, max_length=256, train_size::Float64=0.9, lr::Float64=1e-5, epochs::Int=10, args=nothing)

Function to train discriminator for provided text and target labels, based on set of function paramters provided. Returns discrim discriminator after training.

Here the cached=true allows cacheing of contexualized embeddings (forward pass) in GPT2 model, as the model itself is non-trainable. This reduces the time of training effectively as the forward pass through GPT2 model is to be done only once.

Example

Consider a Multiclass classification problem with class size of 5, it can trained on text and labels vectors using:

train_discriminator(text, labels, 16, "Multiclass", 5)

source

Bag of Words

PPLM.get_bow_indices — Function

get_bow_indices(bow_key_or_path_list::Vector{String}, tokenizer)

Returns a list of list of indices of words from each Bag of word in the bow_key_or_path_list, after tokenization. The functions looks for provided BoW key in the registered artifacts Artifacts.toml file. In case not present there, function expects that bow_key is provided as the complete path to the file the URL to download .txt file.

Example

get_bow_indices(["legal", "military"])

source

PPLM.build_bow_ohe — Function

build_bow_ohe(bow_indices, tokenizer)

Build and return a list of one_hot_matrix for each Bag Of Words list from indices. Each item of the list is of dimension (num_of_BoW_words, tokenizer.vocab_size).

Note: While building the OHE of word indices, it only keeps those words, which have length 1 after tokenization and discard the rest.

source

Generation

Normal

PPLM.sample_normal — Function

sample_normal(;prompt="I hate the customs", tokenizer=nothing, model=nothing, max_length=100, method="top_k", k=50, t=1.2, p=0.5, add_eos_start=true)

Function to generate normal Sentences with model and tokenizer provided. In case not provided, function itself create instance of GPT2-small tokenizer and LM Head Model. The sentences are started with the provided prompt, and generated till token length reaches max_length.

Two sampling methods of generation are provided with this function:

method='top_k'
method='nucleus'

Any of these methods can be used provided with either k or p.

source

PPLM

PPLM.sample_pplm — Function

function sample_pplm(pplm_args;tokenizer=nothing, model=nothing, prompt="I hate the customs", sample_method="top_k", add_eos_start=true)

Function for PPLM model based generation. Generate perturbed sentence using pplm_args, tokenizer and model (GPT2, in case not provided), starting with prompt. In this function the generation is based on the arguments/parameters provided in pplm_args, which is an instance of pplm struct.

source

PPLM.perturb_probs — Function

perturb_probs(probs, tokenizer, args)

Perturb probabilities probs based on provided Bag of Words list (as given with args). This function is supported only for BoW model.

source

PPLM.perturb_hidden_bow — Function

perturb_hidden_bow(hidden, model, tokenizer, args)

Perturb hidden states hidden based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.

Also checkout perturb_hidden_discrim

source

PPLM.perturb_past_bow — Function

perturb_past_bow(model, prev, past, original_probs, args)

Perturb past key values prev based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.

Also checkout perturb_past_discrim

source

PPLM.perturb_hidden_discrim — Function

perturb_hidden_discrim(hidden, model, tokenizer, args)

Perturb hidden states hidden based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.

Also checkout perturb_hidden_bow

source

PPLM.perturb_past_discrim — Function

perturb_past_discrim(model, prev, past, original_probs, args)

Perturb past key values prev based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.

Also checkout perturb_past_bow

source

Utils

PPLM.get_gpt2 — Function

get_gpt2()

Function to load gpt2 lmheadmodel along with the tokenizer.

source

PPLM.get_gpt2_medium — Function

get_gpt2_medium()

Function to load gpt2-medium lmhead model along with the tokenizer.

Note: In case this function gives error of permission denied, try changing the file permissions for the Artifacts.toml file of Transformers.jl package (as it is read only by default) under the src/huggingface folder.

source

PPLM.set_device — Function

set_device(d_id=0)

Function to set cuda device if available and also to disallow scalar operations

source

PPLM.get_registered_file — Function

get_registered_file(name)

Fetch registered file path from Artifacts.toml, based on the artifact name.

source

PPLM.get_artifact — Function

get_artifact(name)

Utility function to download/install the artifact in case not already installed.

source

PPLM.register_custom_file — Function

register_custom_file(artifact_name, file_name, path)

Function to register custom file under artifact_name in Artifacts.toml. path expects path of the directory where the file file_name is stored. Stores the complete path to the file as Artifact URL.

Example

register_custom_file('custom', 'xyz.txt','./folder/folder/')

Note: In case this gives permission denied error, change the Artifacts.toml file permissions using chmod(path_to_file_in_julia_installation , 0o764)or similar.

source

PPLM.top_k_sample — Function

top_k_sample(probs; k=10)

Sampling function to return index from top_k probabilities, based on provided k. Function removes all tokens with a probability less than the last token of the top_k before sampling.

source

PPLM.nucleus_sample — Function

nucleus_sample(probs; p=0.8)

Nuclues sampling function, to return after sampling reverse sorted probabilities probs till the index, where cumulative probability remains less than provided p. It removes tokens with cumulative probability above the threshold p before sampling.

source

PPLM.binary_accuracy — Function

binary_accuracy(y_pred, y_true; threshold=0.5)

Calculates Averaged Binary Accuracy based on y_pred and y_true. Argument threshold is used to specify the minimum predicted probability y_pred required to be labelled as 1. Default value set as 0.5.

source

PPLM.categorical_accuracy — Function

categorical_accuracy(y_pred, y_true)

Calculates Averaged Categorical Accuracy based on y_pred and y_true.

source