API Functions
Here are some of the API functions provided with this package:
GPT2 Tokenizer
PPLM.load_pretrained_tokenizer
— Functionload_pretrained_tokenizer(ty::Type{T}; unk_token="<|endoftext|>", eos_token="<|endoftext|>", pad_token="<|endoftext|>") where T<:PretrainedTokenizer
Load GPT2 tokenizer using Datadeps for pretrained bpe and vocab. Returns tokenizer as GPT2Tokenizer
structure.
load_pretrained_tokenizer(path_bpe, path_vocab, unk_token, eos_token, pad_token)
Load pretrained tokenizer for GPT2 from provided bpe and vocab file path. Initialises unk_token
, eos_token
, pad_token
as provided with the function. Returns tokenizer as GPT2Tokenizer
structure.
PPLM.tokenize
— Functiontokenize(t::GPT2Tokenizer, text::AbstractString)
Function to tokenize given text
with tokenizer bpe encoder (t.bpe_encode
). Returns a string vector of tokens.
PPLM.encode
— Functionencode(t::GPT2Tokenizer, text::AbstractString; add_prefix_space=false)
Returns the encoded vector of tokens (mapping from vocab of Tokenizer) for text
. If add_prefix_space
=true, add space at the start of 'text' before tokenization.
Example
For single text:
encode(tokenizer, text)
For vector of text:
map(x->encode(tokenizer, x), text_vector)
encode(t::GPT2Tokenizer, tokens::Vector{String})
Function to encode tokens vectors to their integer mapping from vocab of tokenizer.
PPLM.decode
— Functiondecode(vocab::Vocabulary{T}, is::Vector{Int}) where T
Return decoded vector of string
tokens from the indices vector is
, using the vocab.
decode(t::GPT2Tokenizer, tokens_ids::Vector{Int})
Return decoded vector of string
tokens from the indices vector tokens_ids
, using the tokenizer t
encoder .
PPLM.detokenize
— Functiondetokenize(t::GPT2Tokenizer, tokens::Vector{String})
BPE Decode the vector of strings, using the tokenizer t
.
detokenize(t::GPT2Tokenizer, tokens_ids::Vector{Int})
Decode and Detokenize the vector of indices token_ids
. Returns the final sentence after detokenization.
Example
For single vector of token_ids:
detokenize(tokenizer, token_ids)
For vector of vector of token_ids
, use:
map(x->decode(tokenizer, x), tokens_id_vector_of_vector)
Discriminator Model
General
PPLM.ClassifierHead
— Typestruct ClassifierHead linearlayer::Dense embedsize::Int class_size::Int end
Struct for ClassifiedHead, defined with a single linear layer and two paramters: embedsize-> Size of Embedding, classsize->Number of classes.
PPLM.get_discriminator
— Functionget_discriminator(model; load_from_pretrained=false, discrim=nothing, file_name=nothing, version=2, class_size::Int=1, embed_size::Int=768, path=nothing)
Function to create discriminator based on provided model. Incase, load_from_pretrained
is set to be true, loads ClassifierHead layer from pretrained models or path
provided.
PPLM.save_classifier_head
— Functionsave_classifier_head(cl_head; file_name=nothing, path=nothing, args=nothing, register_discrim=true, discrim_name="")
Function to save the ClassifiedHead as a BSON once the training is complete, based on the path provided. In case path is set as nothing, it saves the discriminators in ./pretrained_discriminators
folder relative to the package directory.
PPLM.save_discriminator
— Functionsave_discriminator(discrim, discrim_name="Custom"; file_name=nothing, path=nothing, args=nothing)
Function to save ClassifiedHead part of discriminator (by calling save_classifier_head
function), which is the only trainable part of discriminator
Data Processing
PPLM.pad_seq
— Functionpad_seq(batch::AbstractVector{T}, pad_token::Integer=0)
Function to add pad tokens in shorter sequence, to make the length of each sequence equal to the max_length
( calculated as max(map(length, batch))
) in the batch. Pad token defaults to 0
.
PPLM.get_mask
— Functionget_mask(seq::AbstractMatrix{T}, pad_token::Integer=0, embed_size::Integer=768)
Function to create mask for sequences against padding, so as to inform the model, that some part of sequenece is padded and hence to be ignored.
PPLM.data_preprocess
— Functiondata_preprocess(data_x, data_y, classification_type::String="Binary", num_classes::Integer=2; args=nothing)
Function to preprocess data_x
and data_y
along with creating mask for the data_x.
Preprocessing for data_x
consist of padding with pad token (expected to be provided as args.pad_token
).
Preprocessing for data_y
consist of creating onehotbach
for data_y
(if classification_type
is not "Binary"), for 1:num_classes
else reshape the data as (1, length(data_y))
Returns data_x
, data_y
, mask
after pre-processing.
PPLM.load_data
— Functionload_data(data_x, data_y, tokenizer::PretrainedTokenizer; batchsize::Integer=8, truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, drop_last::Bool=false, add_eos_start::Bool=true)
Returns DataLoader for the data_x
and data_y
after processing the datax, with batchsize=batchsize
. The processing consist of tokenization of datax and further truncation to max_len
if truncate
is set to be true.
If add_eos_start
is set to true, add EOS token of tokenizer to the start.
PPLM.load_cached_data
— Functionload_cached_data(discrim::Union{DiscriminatorV1, DiscriminatorV2}, data_x, data_y, tokenizer::PretrainedTokenizer; truncate::Bool=false, max_length::Integer=256, shuffle::Bool=false, batchsize::Int=4, drop_last::Bool=false, classification_type="Binary", num_classes=2, args=nothing)
Returns a DataLoader with (x, y) which can directly be feeded into classifier layer for training.
The function first loads the data using load_data
function with batchsize=1, then passes each batch to the transformer model of discrim
after data preprocessing, and then the average representation of the hidden_states
are stored in a vector, which are then further loaded into a DataLoader, ready to use for classification training.
Note: This functions saves time by cacheing the average representation of hidden states beforehand, avoiding passing the data through model in each epoch of training. This can be done as the model itself is non-trainable
while training discriminator classifier head.
PPLM.load_data_from_csv
— Functionload_data_from_csv(path_to_csv; text_col="text", label_col="label", delim=',', header=1)
Load the data from a csv file based on the specified text_col
column for text and label_col
for target label. Returns vectors for text
and label
.
Training
PPLM.train!
— Functiontrain!(discrim, data_loader; args=args)
Train the discriminator using the provided data_loader
training data and arguments args
provided.
PPLM.test!
— Functiontest!(discrim, data_loader; args=nothing)
Test the discriminator on test data provided using data_loader
, based on Accuracy and NLL Loss.
PPLM.train_discriminator
— Functiontrain_discriminator(text, labels, batchsize::Int=8, classification_type::String="Binary", num_classes::Int=2; model="gpt2", cached::Bool=true, discrim=nothing, tokenizer=nothing, truncate=true, max_length=256, train_size::Float64=0.9, lr::Float64=1e-5, epochs::Int=10, args=nothing)
Function to train discriminator for provided text
and target labels
, based on set of function paramters provided. Returns discrim
discriminator after training.
Here the cached=true
allows cacheing of contexualized embeddings (forward pass) in GPT2 model, as the model itself is non-trainable. This reduces the time of training effectively as the forward pass through GPT2 model is to be done only once.
Example
Consider a Multiclass classification problem with class size of 5, it can trained on text
and labels
vectors using:
train_discriminator(text, labels, 16, "Multiclass", 5)
Bag of Words
PPLM.get_bow_indices
— Functionget_bow_indices(bow_key_or_path_list::Vector{String}, tokenizer)
Returns a list of list of indices
of words from each Bag of word in the bow_key_or_path_list
, after tokenization. The functions looks for provided BoW key in the registered artifacts Artifacts.toml
file. In case not present there, function expects that bow_key is provided as the complete path to the file the URL to download .txt file.
Example
get_bow_indices(["legal", "military"])
PPLM.build_bow_ohe
— Functionbuild_bow_ohe(bow_indices, tokenizer)
Build and return a list of one_hot_matrix
for each Bag Of Words list from indices. Each item of the list is of dimension (num_of_BoW_words, tokenizer.vocab_size)
.
Note: While building the OHE of word indices, it only keeps those words, which have length 1
after tokenization and discard the rest.
Generation
Normal
PPLM.sample_normal
— Functionsample_normal(;prompt="I hate the customs", tokenizer=nothing, model=nothing, max_length=100, method="top_k", k=50, t=1.2, p=0.5, add_eos_start=true)
Function to generate normal Sentences with model
and tokenizer
provided. In case not provided, function itself create instance of GPT2-small tokenizer and LM Head Model. The sentences are started with the provided prompt
, and generated till token length reaches max_length
.
Two sampling methods of generation are provided with this function:
- method='top_k'
- method='nucleus'
Any of these methods can be used provided with either k or p.
PPLM
PPLM.sample_pplm
— Functionfunction sample_pplm(pplm_args;tokenizer=nothing, model=nothing, prompt="I hate the customs", sample_method="top_k", add_eos_start=true)
Function for PPLM model based generation. Generate perturbed sentence using pplm_args
, tokenizer and model (GPT2, in case not provided), starting with prompt
. In this function the generation is based on the arguments/parameters provided in pplm_args
, which is an instance of pplm
struct.
PPLM.perturb_probs
— Functionperturb_probs(probs, tokenizer, args)
Perturb probabilities probs
based on provided Bag of Words list (as given with args
). This function is supported only for BoW model.
perturb_hidden_bow(hidden, model, tokenizer, args)
Perturb hidden states hidden
based on provided Bag of Words list (as given with args
). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.
Also checkout perturb_hidden_discrim
PPLM.perturb_past_bow
— Functionperturb_past_bow(model, prev, past, original_probs, args)
Perturb past key values prev
based on provided Bag of Words list (as given with args
). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Bag of Words and KL Divergence from original token.
Also checkout perturb_past_discrim
perturb_hidden_discrim(hidden, model, tokenizer, args)
Perturb hidden states hidden
based on provided Discriminator (as given with args
). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.
Also checkout perturb_hidden_bow
PPLM.perturb_past_discrim
— Functionperturb_past_discrim(model, prev, past, original_probs, args)
Perturb past key values prev
based on provided Discriminator (as given with args
). The perturbation is primarily based on the gradient calculated over losses evaluated over desired Discriminator attribute and KL Divergence from original token.
Also checkout perturb_past_bow
Utils
PPLM.get_gpt2
— Functionget_gpt2()
Function to load gpt2 lmheadmodel along with the tokenizer
.
PPLM.get_gpt2_medium
— Functionget_gpt2_medium()
Function to load gpt2-medium lmhead model along with the tokenizer
.
Note: In case this function gives error of permission denied, try changing the file permissions for the Artifacts.toml file of Transformers.jl package (as it is read only by default) under the src/huggingface
folder.
PPLM.set_device
— Functionset_device(d_id=0)
Function to set cuda device if available and also to disallow scalar operations
PPLM.get_registered_file
— Functionget_registered_file(name)
Fetch registered file path from Artifacts.toml, based on the artifact name
.
PPLM.get_artifact
— Functionget_artifact(name)
Utility function to download/install the artifact in case not already installed.
PPLM.register_custom_file
— Functionregister_custom_file(artifact_name, file_name, path)
Function to register custom file under artifact_name
in Artifacts.toml. path
expects path of the directory where the file file_name
is stored. Stores the complete path to the file as Artifact URL.
Example
register_custom_file('custom', 'xyz.txt','./folder/folder/')
Note: In case this gives permission denied error, change the Artifacts.toml file permissions using chmod(path_to_file_in_julia_installation , 0o764)
or similar.
PPLM.top_k_sample
— Functiontop_k_sample(probs; k=10)
Sampling function to return index from top_k
probabilities, based on provided k
. Function removes all tokens with a probability less than the last token of the top_k before sampling.
PPLM.nucleus_sample
— Functionnucleus_sample(probs; p=0.8)
Nuclues sampling function, to return after sampling reverse sorted probabilities probs
till the index, where cumulative probability remains less than provided p
. It removes tokens with cumulative probability above the threshold p
before sampling.
PPLM.binary_accuracy
— Functionbinary_accuracy(y_pred, y_true; threshold=0.5)
Calculates Averaged Binary Accuracy based on y_pred
and y_true
. Argument threshold
is used to specify the minimum predicted probability y_pred
required to be labelled as 1
. Default value set as 0.5
.
PPLM.categorical_accuracy
— Functioncategorical_accuracy(y_pred, y_true)
Calculates Averaged Categorical Accuracy based on y_pred
and y_true
.