Train German language model (NLP)
Project detail
Hi,
We are looking for a python developer to achieve the following two goals:
– Compare user input (text) to dataset and show top three similarity matches with some context (preceding and succeeding two sentences).
– Predict next sentence based on user input (like an autocomplete function).
To achieve these goals, we expect you to train a German language model with given text data.
• First milestone: Extraction of data. You will need to extract the data from these PDFs (https://zenodo.org/record/6982046/files/CE-BGH_2022-08-16_DE_PDF_Datensatz.zip?download=1) and bring them in an appropriate format (for your orientation see: https://github.com/lavis-nlp/german_legal_sentences; https://github.com/lavis-nlp/GerDaLIR). The dataset will be reviewed by us, discussed, and optimized by you (if necessary).
• Second milestone: Selection of an appropriate pre-trained language model (e.g. BERT, GPT) and training of the model with the extracted data.
• Third milestone: Use language model to find top three similarity matches and output these. Furthermore, use language model to predict next sentence based on user input.
We can discuss any details over chat.
Best,
Thomas