API Functions
Here are some of the API functions provided with this package:
GPT2 Tokenizer
PPLM.load_pretrained_tokenizer — Functionload_pretrained_tokenizer(ty::Type{T}; unk_token="<|endoftext|>", eos_token="<|endoftext|>", pad_token="<|endoftext|>") where T<:PretrainedTokenizerLoad GPT2 tokenizer using Datadeps for pretrained bpe and vocab. Returns tokenizer as GPT2Tokenizer structure.
load_pretrained_tokenizer(path_bpe, path_vocab, unk_token, eos_token, pad_token)Load pretrained tokenizer for GPT2 from provided bpe and vocab file path. Initialises unk_token, eos_token, pad_token as provided with the function. Returns tokenizer as GPT2Tokenizer structure.
PPLM.tokenize — Functiontokenize(t::GPT2Tokenizer, text::AbstractString)Function to tokenize given text with tokenizer bpe encoder (t.bpe_encode). Returns a string vector of tokens.
PPLM.encode — Functionencode(t::GPT2Tokenizer, text::AbstractString; add_prefix_space=false)Returns the encoded vector of tokens (mapping from vocab of Tokenizer) for text. If add_prefix_space=true, add space at the start of 'text' before tokenization.
Example
For single text:
encode(tokenizer, text)For vector of text:
map(x->encode(tokenizer, x), text_vector) encode(t::GPT2Tokenizer, tokens::Vector{String})Function to encode tokens vectors to their integer mapping from vocab of tokenizer.
PPLM.decode — Functiondecode(vocab::Vocabulary{T}, is::Vector{Int}) where TReturn decoded vector of string tokens from the indices vector is, using the vocab.
decode(t::GPT2Tokenizer, tokens_ids::Vector{Int})Return decoded vector of string tokens from the indices vector tokens_ids, using the tokenizer t encoder .
PPLM.detokenize — Functiondetokenize(t::GPT2Tokenizer, tokens::Vector{String})BPE Decode the vector of strings, using the tokenizer t.
detokenize(t::GPT2Tokenizer, tokens_ids::Vector{Int})Decode and Detokenize the vector of indices token_ids. Returns the final sentence after detokenization.
Example
For single vector of token_ids:
detokenize(tokenizer, token_ids)For vector of vector of token_ids, use:
map(x->decode(tokenizer, x), tokens_id_vector_of_vector)Discriminator Model
General
PPLM.ClassifierHead — Typestruct ClassifierHead linearlayer::Dense embedsize::Int class_size::Int end
Struct for ClassifiedHead, defined with a single linear layer and two paramters: embedsize-> Size of Embedding, classsize->Number of classes.
PPLM.get_discriminator — Functionget_discriminator(model; load_from_pretrained=false, discrim=nothing, file_name=nothing, version=2, class_size::Int=1, embed_size::Int=768, path=nothing)Function to create discriminator based on provided model. Incase, load_from_pretrained is set to be true, loads ClassifierHead layer from pretrained models or path provided.
PPLM.save_classifier_head — Functionsave_classifier_head(cl_head; file_name=nothing, path=nothing, args=nothing, register_discrim=true, discrim_name="")Function to save the ClassifiedHead as a BSON once the training is complete, based on the path provided. In case path is set as nothing, it saves the discriminators in ./pretrained_discriminators folder relative to the package directory.
PPLM.save_discriminator — Functionsave_discriminator(discrim, discrim_name="Custom"; file_name=nothing, path=nothing, args=nothing)Function to save ClassifiedHead part of discriminator (by calling save_classifier_head function), which is the only trainable part of discriminator
Data Processing
PPLM.pad_seq — Functionpad_seq(batch::AbstractVector{T}, pad_token::Integer=0)Function to add pad tokens in shorter sequence, to make the length of each sequence equal to the max_length ( calculated as max(map(length, batch))) in the batch. Pad token defaults to 0.
PPLM.get_mask — Functionget_mask(seq::AbstractMatrix{T}, pad_token::Integer=0, embed_size::Integer=768)Function to create mask for sequences against padding, so as to inform the model, that some part of sequenece is padded and hence to be ignored.
PPLM.data_preprocess — Functiondata_preprocess(data_x, data_y, classification_type::String="Binary", num_classes::Integer=2; args=nothing)Function to preprocess data_x and data_y along with creating mask for the data_x.
Preprocessing for data_x consist of padding with pad token (expected to be provided as args.pad_token).
Preprocessing for data_y consist of creating onehotbach for data_y (if classification_type is not "Binary"), for 1:num_classes else reshape the data as (1, length(data_y))
Returns data_x, data_y, mask after pre-processing.
PPLM.load_data — Functionload_data(data_x, data_y, tokenizer::PretrainedTokenizer; batchsize::Integer=8, truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, drop_last::Bool=false, add_eos_start::Bool=true)Returns DataLoader for the data_x and data_y after processing the datax, with batchsize=batchsize. The processing consist of tokenization of datax and further truncation to max_len if truncate is set to be true.
If add_eos_start is set to true, add EOS token of tokenizer to the start.
PPLM.load_cached_data — Functionload_cached_data(discrim::Union{DiscriminatorV1, DiscriminatorV2}, data_x, data_y, tokenizer::PretrainedTokenizer; truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, batchsize::Int=4, drop_last::Bool=false, classification_type="Binary", num_classes=2, args=nothing)Returns a DataLoader with (x, y) which can directly be feeded into classifier layer for training.
The function first loads the data using load_data function with batchsize=1, then passes each batch to the transformer model of discrim after data preprocessing, and then the average representation of the hidden_states are stored in a vector, which are then further loaded into a DataLoader, ready to use for classification training.
Note: This functions saves time by cacheing the average representation of hidden states beforehand, avoiding passing the data through model in each epoch of training. This can be done as the model itself is non-trainable while training discriminator classifier head.
PPLM.load_data_from_csv — Functionload_data_from_csv(path_to_csv; text_col="text", label_col="label", delim=',', header=1)Load the data from a csv file based on the specified text_col column for text and label_col for target label. Returns vectors for text and label.
Training
PPLM.train! — Functiontrain!(discrim, data_loader; args=args)Train the discriminator using the provided data_loader training data and arguments args provided.
PPLM.test! — Functiontest!(discrim, data_loader; args=nothing)Test the discriminator on test data provided using data_loader, based on Accuracy and NLL Loss.
PPLM.train_discriminator — Functiontrain_discriminator(text, labels, batchsize::Int=8, classification_type::String="Binary", num_classes::Int=2; model="gpt2", cached::Bool=true, discrim=nothing, tokenizer=nothing, truncate=true, max_length=256, train_size::Float64=0.9, lr::Float64=1e-5, epochs::Int=10, args=nothing)Function to train discriminator for provided text and target labels, based on set of function paramters provided. Returns discrim discriminator after training.
Here the cached=true allows cacheing of contexualized embeddings (forward pass) in GPT2 model, as the model itself is non-trainable. This reduces the time of training effectively as the forward pass through GPT2 model is to be done only once.
Example
Consider a Multiclass classification problem with class size of 5, it can trained on text and labels vectors using:
train_discriminator(text, labels, 16, "Multiclass", 5)Bag of Words
PPLM.get_bow_indices — Functionget_bow_indices(bow_key_or_path_list::Vector{String}, tokenizer)Returns a list of list of indices of words from each Bag of word in the bow_key_or_path_list, after tokenization. The functions looks for provided BoW key in the registered artifacts Artifacts.toml file. In case not present there, function expects that bow_key is provided as the complete path to the file the URL to download .txt file.
Example
get_bow_indices(["legal", "military"])PPLM.build_bow_ohe — Functionbuild_bow_ohe(bow_indices, tokenizer)Build and return a list of one_hot_matrix for each Bag Of Words list from indices. Each item of the list is of dimension (num_of_BoW_words, tokenizer.vocab_size).
Note: While building the OHE of word indices, it only keeps those words, which have length 1 after tokenization and discard the rest.
Generation
Normal
PPLM.sample_normal — Functionsample_normal(;prompt="I hate the customs", tokenizer=nothing, model=nothing, max_length=100, method="top_k", k=50, t=1.2, p=0.5, add_eos_start=true)Function to generate normal Sentences with model and tokenizer provided. In case not provided, function itself create instance of GPT2-small tokenizer and LM Head Model. The sentences are started with the provided prompt, and generated till token length reaches max_length.
Two sampling methods of generation are provided with this function:
- method='top_k'
- method='nucleus'
Any of these methods can be used provided with either k or p.
PPLM
PPLM.sample_pplm — Functionfunction sample_pplm(pplm_args;tokenizer=nothing, model=nothing, prompt="I hate the customs", sample_method="top_k", add_eos_start=true)Function for PPLM model based generation. Generate perturbed sentence using pplm_args, tokenizer and model (GPT2, in case not provided), starting with prompt. In this function the generation is based on the arguments/parameters provided in pplm_args, which is an instance of pplm struct.
PPLM.perturb_probs — Functionperturb_probs(probs, tokenizer, args)Perturb probabilities probs based on provided Bag of Words list (as given with args). This function is supported only for BoW model.
perturb_hidden_bow(hidden, model, tokenizer, args)Perturb hidden states hidden based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.
Also checkout perturb_hidden_discrim
PPLM.perturb_past_bow — Functionperturb_past_bow(model, prev, past, original_probs, args)Perturb past key values prev based on provided Bag of Words list (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.
Also checkout perturb_past_discrim
perturb_hidden_discrim(hidden, model, tokenizer, args)Perturb hidden states hidden based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.
Also checkout perturb_hidden_bow
PPLM.perturb_past_discrim — Functionperturb_past_discrim(model, prev, past, original_probs, args)Perturb past key values prev based on provided Discriminator (as given with args). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.
Also checkout perturb_past_bow
Utils
PPLM.get_gpt2 — Functionget_gpt2()Function to load gpt2 lmheadmodel along with the tokenizer.
PPLM.get_gpt2_medium — Functionget_gpt2_medium()Function to load gpt2-medium lmhead model along with the tokenizer.
Note: In case this function gives error of permission denied, try changing the file permissions for the Artifacts.toml file of Transformers.jl package (as it is read only by default) under the src/huggingface folder.
PPLM.set_device — Functionset_device(d_id=0)Function to set cuda device if available and also to disallow scalar operations
PPLM.get_registered_file — Functionget_registered_file(name)Fetch registered file path from Artifacts.toml, based on the artifact name.
PPLM.get_artifact — Functionget_artifact(name)Utility function to download/install the artifact in case not already installed.
PPLM.register_custom_file — Functionregister_custom_file(artifact_name, file_name, path)Function to register custom file under artifact_name in Artifacts.toml. path expects path of the directory where the file file_name is stored. Stores the complete path to the file as Artifact URL.
Example
register_custom_file('custom', 'xyz.txt','./folder/folder/')
Note: In case this gives permission denied error, change the Artifacts.toml file permissions using chmod(path_to_file_in_julia_installation , 0o764)or similar.
PPLM.top_k_sample — Functiontop_k_sample(probs; k=10)Sampling function to return index from top_k probabilities, based on provided k. Function removes all tokens with a probability less than the last token of the top_k before sampling.
PPLM.nucleus_sample — Functionnucleus_sample(probs; p=0.8)Nuclues sampling function, to return after sampling reverse sorted probabilities probs till the index, where cumulative probability remains less than provided p. It removes tokens with cumulative probability above the threshold p before sampling.
PPLM.binary_accuracy — Functionbinary_accuracy(y_pred, y_true; threshold=0.5)Calculates Averaged Binary Accuracy based on y_pred and y_true. Argument threshold is used to specify the minimum predicted probability y_pred required to be labelled as 1. Default value set as 0.5.
PPLM.categorical_accuracy — Functioncategorical_accuracy(y_pred, y_true)Calculates Averaged Categorical Accuracy based on y_pred and y_true.