I am a researcher at the Austrian Research Institute for Artificial Intelligence and an enthusiastic developer of open source software.

Most recently I have been working on a project that analyses the behaviour of visitors of a large Austrian news site and how the discourse in the online discussion forums influences the interaction with the site and on a project that uses NLP (among other approaches) to find, understand and warn about abuse directed at women journalists on Twitter.

I was part of the core development team that develops the GATE NLP framework and main designer and developer of the Python GateNLP package and have developed and contributed to a number of GATE plugins and other GATE related software as well as many other open-source projects.

I am interested in a wide range of research topics, including machine learning approaches for text tagging, learning to rank and metric learning, approaches to natural language processing based on imitation learning and cost-sensitive learning and the synergies between knowledge representation and ontologies on one hand and natural language processing on the other hand.

Contact

Freyung 6/6, A-1010 Vienna, Austria
Email: johann.petrak (AT) gmail.com (preferred)
Email: johann.petrak (AT) ofai.at
Skype: joh_pet
GitHub / GitLab / BitBucket
Google Scholar
dblp
Semantic Scholar
ORCID
Keybase
LinkedIn
Facebook
Threads @johannpetrak
PostNews @johann_petrak
Mastodon
Twitter @johann_p
Bluesky @johann-petrak.bsky.social
Keyoxide

Publications

Selected Projects

A list of selected projects I participated in:

Knowmak (2017/18): methods for ontology enrichment and keyword extraction for ontology-based topic annotation of documents
Comrades (2016): Collective Platform for Community Resilience and Social Innovation during Crises. Development and deployment of a multilingual entity disambiguation and linking system as a service, based on GATE.
DeCarboNet (2016): A Decarbonisation Platform for Citizen Empowerment and Translating Collective Awareness into Behavioural Change
Trendminer (2013/14): large scale semantic annotation of news articles and tweets based on named-entities from Wikipedia/DBPedia and development of approaches for the disambiguation between candidate entities.
Khresmoi (2013/14): large scale semantic annotation of health-related documents and web pages for semantic search; annotation of named entities and relations for anatomy and pathology terms in German radiology reports.
OREX (2010-12): Ontology-based information extraction and search: Ontology-based recognition and extraction of relevant information from online job-ads and semantic search of the extracted information.
LarKC (EU FP 7 Large-Scale Integrating Project) (2009): random indexing for the detection if semantically related nodes in large ontologies and linked biomedical data.
INSPIRATION (K-Net COAST) (2009/10): structural analysis of medical dictations, automatic recognition of sentence boundaries in audio-transcripts.

Software

I am an enthusiastic developer and supporter of open source software and have contributed to the development of the GATE NLP framework, the Python GateNLP NLP framework, many GATE NLP plugins, the FARM deep learning framwork, and a number of other projects.

Some selected software projects:

Python gatenlp
Python 3 library that brings the main GATE abstractions and features to Python and allows to interface to popular NLP libraries like Spacy and Stanford Stanza.
Python configsimple
A python package that makes it easy to configure each component of a larger system in a way similar to argparse and from config files and query all configuration parameters from the command line.
gateplugin-LearningFramework
A GATE plugin for using various machine learning algorithms from withing GATE. It supports classification, regression and tagging tasks and allows the use of algorithms from LibSVM, Mallet, Weka (as external program), Scikit-Learn (external, Python), CostCLA (external, Python), Pytorch (external, Python) and Keras (external, Python)
python-sparsevectors
A fast implementation of sparse float fectors based on defaultdict(float) useful to speed up various machine learning algorithms. This is based on Liang Huang’s hvector library but works with Python 3.
gateplugin-ModularPipelines
A GATE plugin that brings two important properties to GATE pipelines: modularity and parametrizability. The plugin provides a new processing resource which makes it easy to include pipelines within pipelines while keeping each of the contained pipeline files separate. It also provides a new kind of controller which allows to override or set any runtime parameter or init parameter for any of the processing resources in the pipeline, or to set document features or enable or disable a PR within the pipline.
gateplugin-Java
A GATE plugin which makes it easy to write Java code that gets executed in a pipeline. The Java code gets compiled on the fly and there is no need to restart GATE or reload the pipeline when the Java program is modified.
gateplugin-StringAnnotation
A GATE plugin which provides processing resources for very flexible matching of text using nestable Java regular expressions, and for very fast and compact use of gazetteer lists for matching either document text or text extracted from annotation features (similar to what the FlexibleGazetteer does).
gateplugin-Tagger_TagMe
A GATE plugin which can connect to the TagMe web API to annotate documents.
gateplugin-CorpusStats
A GATE plugin to create term frequency, document frequency and tf*idf stats for a corpus. This plugin can be run multi-threaded using GCP.
gatelib-interaction
A library that simplifies the interaction between GATE processing resources and external software. The interaction can be done either by starting a separate process and communicating through pipes with the process or by communicating with a separate server. So far this is mainly used for enabling the GATE machine learning plugin, gateplugin-LearningFramework to use Weka, Scikit-Learn, Keras, Pytorch and other external tools.
gateplugin-AnnotationGraphs
A GATE plugin that makes it easier to handle graphs of annotations i.e. annotations representing trees, coreference chains, candidate lists or anything where one annotation needs to refer to one or more other annotations in some way.
gateplugin-Evaluation
A GATE plugin which provides the ability to carry out evaluations from within a pipeline.
gateplugin-JdbcLookup
A GATE plugin which makes it easy to add or update annotations based on looking up information in a JDBC table.
gateplugin-Format_Misc
A GATE plugin for loading and saving documents in a number of additional formats: GZIP compressed GATE XML, GATE XML Snappy compressed, Java Object serialized, Java Object serialized with Snappy compression, Java Object serialized with GZIP compression.
gateplugin-Scala
A GATE plugin which allows to write Scala code that gets executed in a pipeline from within GATE.
gateplugin-Tagger_CoreNLP
A GATE plugin which can connect to the Stanford CoreNLP server to annotate documents.
gateplugin-Tagger_GoogleNLP
A GATE plugin which can connect to the Google NLP Service to annotate documents.
simple-issues-tracker
A very simple script for tracking issues. This is meant to be used from within a git repository and will simply manage issues by creating a new file for each issue in a subdirectory of the repository.
license-headers
A simple python script to add or replace license headers to all files in a directory tree of source files.
SesameSPARQL
A tool to query an OpenRDF Sesame SPARQL endpoint and retrieve results as TSV files. Allows to handle large result sets without timing out or exhausting the memory of the server by retrieving results in small batches and increasing the default query timeout.
gatetools-runpipeline
A useful and flexible command line script to run a GATE pipeline on documents in a directory.
gateplugin-dict-lemmatizer
A GATE plugin which adds lemmata to tokens based on their universial dependencies POS-tags. This currently works for English, German, French, Italian, Dutch and Spanish, though for some languages only wiktionary-based lookups are used while others use a morphological transducer. This is based on the code by Ahmet Aker: http://staffwww.dcs.shef.ac.uk/people/A.Aker/activityNLPProjects.html