1.
Neuralangelo: High-Fidelity Neural Surface Reconstruction
NVlabs/neuralangelo • • CVPR 2023
Neural surface reconstruction has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering.
Neural Rendering Surface Reconstruction
2. Efficient Guided Generation for Large Language Models
normal-computing/outlines • • 19 Jul 2023
In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions
between the states of a finite-state machine.
Language Modelling Text Generation
3. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
apple/ml-fastvit • • 24 Mar 2023
To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural
reparameterization to lower the memory access cost by removing skip-connections in the network.
Image Classification
4. Generative Agents: Interactive Simulacra of Human Behavior
joonspk-research/generative_agents • 7 Apr 2023
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal
spaces for interpersonal communication to prototyping tools.
Language Modelling
5. OctoPack: Instruction Tuning Large Language Models
big-project/octopack • • 14 Aug 2023
We benchmark CommitPack against other natural and synthetic instructions (xP3x, Self-Instruct, OASST) on the 16B parameter
Starr model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python
benchmark (46. 2% pass@1).
Repair
6. Platypus: Quick, Cheap, and Powerful Refinement of LLMs
arielnlee/Platypus • • 14 Aug 2023
We present a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and
currently stands at first place in HuggingFace's Open LLM Leaderboard as of the release date of this work.
7. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction
adbar/trafilatura • ACL 2021
The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.
8. F: Content Deformation Fields for Temporally Consistent Video Processing
qiuyu96/f • • 15 Aug 2023
We present the content deformation field F as a new type of video representation, which consists of a canonical content field
aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the
canonical image (i. e., rendered from the canonical content field) to each individual frame along the time axis. Given a target video,
these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline. We advisedly introduce some
regularizations into the optimization process, urging the canonical content field to inherit semantics (e. g., the object shape) from
the video. With such a design, F naturally supports lifting image algorithms for video processing, in the sense that one can apply an
image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal
deformation field. We experimentally show that F is able to lift image-to-image translation to video-to-video translation and lift
keypoint detection to keypoint tracking without any training. More importantly, thanks to our lifting strategy that deploys the
algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-
video translation approaches, and even manage to track non-rigid objects like water and smog. Project page can be found at
https://qiuyu96. github. io/F/.
Image-to-Image Translation Keypoint Detection +1
9. Color-NeuS: Reconstructing Neural Implicit Surfaces with Color
Colmar-zlicheng/Color-NeuS • • 14 Aug 2023
Mesh is extracted from the signed distance function (SDF) network for the surface, and color for each surface vertex is drawn from
the global color network.
10. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
geekan/metagpt • 1 Aug 2023
Recently, remarkable progress has been made in automated task-solving through the use of multi-agent driven by large language
models (LLMs).
11. 3D Gaussian Splatting for Real-Time Radiance Field Rendering
graphdeco-inria/gaussian-splatting • • 8 Aug 2023
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos.
Camera Calibration Novel View Synthesis
12. GeeX: A Pre-Trained Model for Generation with Multilingual Evaluations on HumanEval-X
thudm/geex2 • • 30 Mar 2023
Large pre-trained generation models, such as OpenAI x, can generate syntax- and function-correct , making the coding of
programmers more productive and our pursuit of artificial general intelligence closer.
Ranked #25 on Generation on HumanEval
Generation
13. DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
showlab/datasetdm • • 11 Aug 2023
To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of
downstream tasks, including semantic segmentation, instance segmentation, and depth estimation.
Depth Estimation Domain Generalization +4
14. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
robustnlp/cipherchat • • 12 Aug 2023
We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural
languages -- ciphers.
Ethics
15. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
openbmb/toolbench • • 31 Jul 2023
We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT.
16. Wizardr: Empowering Large Language Models with Evol-Instruct
nlpxucan/wizardlm • • 14 Jun 2023
Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and
HumanEval+.
Ranked #7 on Generation on HumanEval
Generation
17. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
hiyouga/llama-efficient-tuning • • 29 May 2023
However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and
then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far
from the original model.
Language Modelling reinforcement-learning
18. BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
NVlabs/BundleSDF • • CVPR 2023
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while
simultaneously performing neural 3D reconstruction of the object.
3D Object Tracking 3D Reconstruction +4
19. Follow Anything: Open-set detection, tracking, and following in real-time
alaamaalouf/followanything • • 10 Aug 2023
We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of
interest in a real-time control loop.
20. Separate Anything You Describe
audio-agi/audiosep • 9 Aug 2023
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
Audio Source Separation Natural Language Queries +1
21. Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
abyildirim/inst-inpaint • • 6 Apr 2023
From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be
time-consuming and prone to errors.
Image Inpainting
22. Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
dcdmllm/cheetah • • 8 Aug 2023
To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the
sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-
inject it into the LLM.
Image Captioning Instruction Following
23. 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
3d-vista/3D-VisTA • • 8 Aug 2023
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which
is crucial for achieving embodied intelligence.
Dense Captioning Question Answering +3
24. SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning
aim-uofa/segprompt • • 12 Aug 2023
In this work, we propose a novel training mechanism termed SegPrompt that uses category information to improve the model's
class-agnostic segmentation ability for both known and unknown categories.
Instance Segmentation Semantic Segmentation
25. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
facebookresearch/muavic • 1 Mar 2023
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation
providing 1200 s of audio-visual speech in 9 languages.
Audio-Visual Speech Recognition Robust Speech Recognition +4
26. LISA: Reasoning Segmentation via Large Language Model
dvlab-research/lisa • • 1 Aug 2023
In this work, we propose a new segmentation task -- reasoning segmentation.
Language Modelling Text Generation
27. Shepherd: A Critic for Language Model Generation
facebookresearch/shepherd • 8 Aug 2023
As large language models improve, there is increasing interest in techniques that leverage these models' capabilities to refine their
own outputs.
Language Modelling
28. #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
ofa-sys/instag • 14 Aug 2023
Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source
datasets and fine-tune models on InsTag-selected data.
Instruction Following TAG
29. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-
turn Dialogue
suprityoung/zhongjing • 7 Aug 2023
Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to
user intents.
Instruction Following Language Modelling
30. PolyLM: An Open Source Polyglot Large Language Model
modelscope/modelscope • • 12 Jul 2023
Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language
instructions.
Language Modelling Question Answering
31. LLM As DBA
tsinghuadatabasegroup/db-gpt • 10 Aug 2023
Database administrators (DBAs) play a crucial role in managing, maintaining and optimizing a database system to ensure data
availability, performance, and reliability.
32. AnyLoc: Towards Universal Visual Place Recognition
AnyLoc/AnyLoc • • 1 Aug 2023
In this work, we develop a universal solution to VPR -- a technique that works across a broad range of structured and unstructured
environments (urban, outdoors, indoors, aerial, underwater, and subterranean environments) without any re-training or fine-
tuning.
Ranked #1 on Visual Place Recognition on Nardo-Air R
Image Retrieval Visual Place Recognition
33. Llama 2: Open Foundation and Fine-Tuned Chat Models
Lightning-AI/lit-gpt • • 18 Jul 2023
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in
scale from 7 billion to 70 billion parameters.
Ranked #2 on Question Answering on TriviaQA
Arithmetic Reasoning Generation +4
34. PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
facebookresearch/pug • • 8 Aug 2023
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to
(i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions),
(iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation.
Representation Learning
35. SMILE: Single-turn to Multi-turn Inclusive Language Expansion via ChatGPT for Mental Health Support
qiuhuachuan/smile • • 30 Apr 2023
There has been an increasing research interest in developing specialized dialogue systems that can offer mental health support.
36. Effective Whole-body Pose Estimation with Two-stages Distillation
idea-research/dwpose • • 29 Jul 2023
Different from the previous self-knowledge distillation, this stage finetunes the student's head with only 20% training time as a
plug-and-play training strategy.
Ranked #1 on 2D Human Pose Estimation on COCO-WholeBody (using extra training data)
2D Human Pose Estimation Pose Estimation +1
37. Global Features are All You Need for Image Retrieval and Reranking
shihaoshao-gh/superglobal • 14 Aug 2023
We, for the first time, propose an image retrieval paradigm leveraging global feature only to enable accurate and lightweight image
retrieval for both coarse retrieval and reranking, thus the name - SuperGlobal.
Image Retrieval Retrieval
38. Large Language Models for Information Retrieval: A Survey
ruc-nlpir/llm4ir-survey • 14 Aug 2023
This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid
response) and modern neural architectures (such as language models with powerful language understanding capacity).
Information Retrieval Question Answering +2
39. UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation
haiyang-w/unitr • 15 Aug 2023
Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous
driving systems.
3D Object Detection Autonomous Driving +2
40. EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models
zjunlp/easyedit • • 14 Aug 2023
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen
events or generate text with incorrect facts owing to the outdated/noisy data.
41. LightGlue: Local Feature Matching at Light Speed
cvg/lightglue • • 23 Jun 2023
We introduce LightGlue, a deep neural network that learns to match local features across images.
3D Reconstruction Homography Estimation +3
42. Universal and Transferable Adversarial Attacks on Aligned Language Models
llm-attacks/llm-attacks • • 27 Jul 2023
Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable
content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer).
Adversarial Attack
43. SHERF: Generalizable Human NeRF from a Single Image
skhu101/sherf • • 22 Mar 2023
To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to
facilitate informative encoding.
3D Human Reconstruction
44. Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow
bcmi/DCI-VTON-Virtual-Try-On • 11 Aug 2023
Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the
diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results.
Denoising Image Generation +1
45. UniWorld: Autonomous Driving Pre-training via World Models
chaytonmin/uniworld • 14 Aug 2023
In this , we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as
World Models for robots.
3D Object Detection Autonomous Driving +2
46. Fine-Tuning Language Models from Human Preferences
lvwerra/trl • • 18 Sep 2019
Most work on reward learning has used simulated environments, but complex information about values is often expressed in
natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks.
Descriptive Language Modelling +1
47. MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving
open-air-sun/mars • • 27 Jul 2023
We expect this modular design to boost academic progress and industrial deployment of NeRF-based autonomous driving
simulation.
Autonomous Driving
48. Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion
PhonemeHallucinator/Phoneme_Hallucinator • • 11 Aug 2023
Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both
intelligibility and speaker similarity.
Voice Conversion
49. All in One: Multi-task Prompting for Graph Neural Networks
sheldonresearch/ProG • • 4 Jul 2023
Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging
prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-
trained models and various graph tasks.
Meta-Learning
50. NNVISR: Bring Neural Network Video Interpolation and Super Resolution into Video Processing Framework
tongyuantongyu/vs-nnvisr • 6 Aug 2023
We present NNVISR - an open-source filter plugin for the VapourSynth video processing framework, which facilitates the
application of neural networks for various kinds of video enhancing tasks, including denoising, super resolution, interpolation, and
spatio-temporal super-resolution.
Denoising Super-Resolution +1
51. Revisiting the Minimalist Approach to Offline Reinforcement Learning
corl-team/CORL • • 16 May 2023
In this work, we aim to bridge this gap by conducting a retrospective analysis of recent works in offline RL and propose ReBRAC, a
minimalistic algorithm that integrates such design elements built on top of the TD3+BC method.
D4RL Offline RL +2
52. Volume Rendering of Neural Implicit Surfaces
mli0603/blenderneuralangelo • NeurIPS 2021
Accurate sampling is important to provide a precise coupling of geometry and radiance; and (iii) it allows efficient unsupervised
disentanglement of shape and appearance in volume rendering.
Disentanglement Inductive Bias
53. Gorilla: Large Language Model Connected with Massive APIs
ShishirPatil/gorilla • • 24 May 2023
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks,
such as mathematical reasoning and program synthesis.
Language Modelling Mathematical Reasoning +2
54. LLaMA: Open and Efficient Foundation Language Models
facebookresearch/llama • • arXiv 2023
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Ranked #1 on Question Answering on SIQA
Common Sense Reasoning Math Word Problem Solving +3
55. OxfordVGG Submission to the EGO4D AV Transcription Challenge
m-bain/whisperx • • 18 Jul 2023
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition
Challenge 2023 from the OxfordVGG team.
Automatic Speech Recognition speech-recognition +1
56. SimplyRetrieve: A Private and Lightweight Retrieval-Centric Generative AI Tool
rcgai/simplyretrieve • • 8 Aug 2023
Large Language Model (LLM) based Generative AI systems have seen significant progress in recent years.
Language Modelling Memorization +1
57. Simple and Controllable Music Generation
facebookresearch/audiocraft • • 8 Jun 2023
We tackle the task of conditional music generation.
Ranked #4 on Text-to-Music Generation on MusicCaps
Music Generation Text-to-Music Generation
58. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
facebookresearch/llama • • 22 May 2023
Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up der inference.
Language Modelling
59. Vision-Language Models for Vision Tasks: A Survey
jingyi0000/vlm_survey • • 3 Apr 2023
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train
a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm.
Benchmarking Knowledge Distillation +1
60. Simple synthetic data reduces sycophancy in large language models
google/sycophancy-intervention • 7 Aug 2023
Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts.
61. Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising
srameo/led • • 7 Aug 2023
Calibration-based methods have dominated RAW image denoising under extremely low-light environments.
Ranked #1 on Image Denoising on SID SonyA7S2 x300
Image Denoising
62. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
teacherpeterpan/self-correction-llm-s • 6 Aug 2023
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
63. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
PanQiWei/AutoGPTQ • • 31 Oct 2022
In this , we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-
order information, that is both highly-accurate and highly-efficient.
Language Modelling Model Compression +1
64. FocalFormer3D : Focusing on Hard Instance for 3D Object Detection
NVlabs/FocalFormer3D • • 8 Aug 2023
For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating
difficult objects and improving prediction recall.
Ranked #8 on 3D Object Detection on nuScenes
3D Object Detection Autonomous Driving +1
65. The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
opengvlab/all-seeing • 3 Aug 2023
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open
world.
Question Answering Retrieval +1
66. LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
TonyLianLong/LLM-groundedDiffusion • • 23 May 2023
We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately
generating images according to prompts that necessitate both language and spatial reasoning.
Common Sense Reasoning
67. Meta-Transformer: A Unified Framework for Multimodal Learning
invictus717/MetaTransformer • • 20 Jul 2023
Multimodal learning aims to build models that can process and relate information from multiple modalities.
Time Series
68. Autoregressive Visual Tracking
miv-xjtu/artrack • • CVPR 2023 2023
We present ARTrack, an autoregressive framework for visual object tracking.
Ranked #1 on Visual Object Tracking on TNL2K
Template Matching Visual Object Tracking +1
69. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
threestudio-project/threestudio • • 25 May 2023
In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational
score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-
to-3D generation.
Text to 3D
70. BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer
haonan-li/cmmlu • • 1 Jul 2023
BatGPT is a large-scale language model designed and trained jointly by Wuhan University and Shanghai Jiao Tong University.
Language Modelling Question Answering +1
71. QAmeleon: Multilingual QA with Only 5 Examples
google-research-datasets/qameleon • 15 Nov 2022
The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA).
Few-Shot Learning Question Answering
72. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
baichuan-inc/baichuan-13b • • ICLR 2022
Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how
does a model achieve extrapolation at inference time for sequences that are longer than it saw during training?
Inductive Bias Playing the Game of 2048 +1
73. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
guoyww/animatediff • • 10 Jul 2023
With the advance of text-to-image models (e. g., Stable Diffusion) and corresponding personalization techniques such as
DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost.
Image Animation
74. Language Models are Few-Shot Learners
ggerganov/llama.cpp • • NeurIPS 2020
By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something
which current NLP systems still largely struggle to do.
Ranked #1 on Zero-Shot Learning on HellaSwag
Common Sense Reasoning Coreference Resolution +11
75. Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback
nlp-uoregon/okapi • • 29 Jul 2023
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of
future multilingual LLM research.
76. A Survey on Multimodal Large Language Models
bradyfu/awesome-multimodal-large-language-models • 23 Jun 2023
Multimodal Large Language Model (MLLM) recently has been a new rising research hotspot, which uses powerful Large Language
Models (LLMs) as a brain to perform multimodal tasks.
Language Modelling Optical Character Recognition (OCR) +1
77. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
stability-ai/generative-models • • 4 Jul 2023
We present SDXL, a latent diffusion model for text-to-image synthesis.
Image Generation
78. Large Multimodal Models: Notes on CVPR 2023 Tutorial
haotian-liu/LLaVA • • 26 Jun 2023
This tutorial note summarizes the presentation on ``Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-
4'', a part of CVPR 2023 tutorial on ``Recent Advances in Vision Foundation Models''.
Language Modelling
79. Semantic-SAM: Segment and Recognize Anything at Any Granularity
ux-der/semantic-sam • • 10 Jul 2023
In this , we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any
desired granularity.
Image Segmentation Semantic Segmentation
80. MMBench: Is Your Multi-modal Model an All-around Player?
InternLM/opencompass • • 12 Jul 2023
In response to these challenges, we propose MMBench, a novel multi-modality benchmark.
81. Dual Aggregation Transformer for Image Super-Resolution
zhengchen1999/dat • • 7 Aug 2023
Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT
aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Image Super-Resolution
82. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
thudm/chatglm2-6b • • 28 Jan 2022
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of
large language models to perform complex reasoning.
GSM8K Language Modelling
83. FoodSAM: Any Food Segmentation
jamesjg/foodsam • • 11 Aug 2023
Remarkably, this pioneering framework stands as the first-ever work to achieve instance, panoptic, and promptable segmentation
on food images.
Ranked #1 on Semantic Segmentation on FoodSeg103 (using extra training data)
Image Segmentation Instance Segmentation +1
84. Segment Anything in High Quality
syscv/sam-hq • • 2 Jun 2023
HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 s on 8 GPUs.
Ranked #1 on Zero Shot Segmentation on Segmentation in the Wild
2D Semantic Segmentation Semantic Segmentation +1
85. MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction
hustvl/maptr • • 10 Aug 2023
We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of
equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process.
Autonomous Driving
86. Memory-and-Anticipation Transformer for Online Action Understanding
echo0125/memory-and-anticipation-transformer • • 15 Aug 2023
Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address
the online action detection and anticipation tasks.
Action Understanding Online Action Detection
87. ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces
qianyiwu/objectsdf_plus • • 15 Aug 2023
Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent
3D scenes as signed distance functions (SDFs).
3D Reconstruction Object Reconstruction +1
88. AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic s
LZH-0225/AudioFormer • 14 Aug 2023
In our experiments, we treat discrete acoustic s as textual data and train a masked language model using a cloze-like methodology,
ultimately deriving high-quality audio representations.
Ranked #1 on Audio Classification on Balanced Audio Set
Audio Classification Classification +2
89. h2oGPT: Democratizing Large Language Models
h2oai/h2ogpt • • 13 Jun 2023
Applications built on top of Large Language Models (LLMs) such as GPT-4 represent a revolution in AI due to their human-level
capabilities in natural language processing.
Chatbot Fairness +8
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
jerry-chee/quip • • 25 Jul 2023
This work studies post-training parameter quantization in large language models (LLMs).
Quantization
90. 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking
dsx0511/3dmotformer • • 12 Aug 2023
Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such
as trajectory prediction and motion planning.
3D Multi-Object Tracking Autonomous Vehicles +5
91. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
dao-ailab/flash-attention • • 17 Jul 2023
We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU,
causing either low-occupancy or unnecessary shared memory reads/writes.
Language Modelling Video Generation
92. "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
verazuo/jailbreak_llms • 7 Aug 2023
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors.
Community Detection
93. MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaption in 3D Object Detection
darrenjkt/ms3d • • 11 Aug 2023
MS3D++ provides a straightforward approach to domain adaptation by generating high-quality pseudo-labels, enabling the
adaptation of 3D detectors to a diverse range of lidar types, regardless of their density.
3D Object Detection Domain Generalization +3
94. LCE: An Augmented Combination of Bagging and Boosting in Python
localcascadeensemble/lce • 14 Aug 2023
lcensemble is a high-performing, scalable and user-friendly Python package for the general tasks of classification and regression.
Model Selection regression
95. VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
vinthony/video-retalking • • 27 Nov 2022
Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-
driven lip-sync; and (3) face enhancement for improving photo-realism.
Video Editing Video Generation
96. GRES: Generalized Referring Expression Segmentation
henghuiding/ReLA • • CVPR 2023
Existing classic RES datasets and methods commonly support single-target expressions only, i. e., one expression refers to one
target object.
Ranked #1 on Generalized Referring Expression Segmentation on gRefCOCO
Generalized Referring Expression Segmentation Referring Expression
97. Enhancing Efficient Continual Learning with Dynamic Structure Development of Spiking Neural Networks
braincog-x/brain-cog • • 9 Aug 2023
In addition, the overlapping shared structure helps to quickly leverage all acquired knowledge to new tasks, empowering a single
network capable of supporting multiple incremental tasks (without the separate sub-network mask for each task).
class-incremental learning Class Incremental Learning +1
98. EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education
icalk-nlp/educhat • • 5 Aug 2023
Currently, EduChat is available online as an open-source project, with its , data, and model parameters available on platforms (e. g.,
GitHub https://github. com/icalk-nlp/EduChat, Hugging Face https://huggingface. co/ecnu-icalk ).
Chatbot Language Modelling +1
99. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving
westlake-autolab/fusionad • 2 Aug 2023
Building a multi-modality multi-task neural network toward accurate and robust performance is a de-facto standard in perception
task of autonomous driving.
Autonomous Driving
100. Making Language Models Better Tool Learners with Execution Feedback
zjunlp/cama • • 22 May 2023
Tools serve as pivotal interfaces that enable humans to understand and reshape the world.
Language Modelling Prompt Engineering
101. Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation
opennmt/ctranslate2 • • 6 Dec 2019
As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous
translation, where the source is repeatedly translated from scratch as it grows.
Machine Translation speech-recognition +2
102. VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
p0p4k/vits2_pytorch • • 31 Jul 2023
Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline
systems.
103. Fine-Tuning Language Models with Just Forward Passes
princeton-nlp/mezo • • 27 May 2023
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation
requires a prohibitively large amount of memory.
Multiple-choice
104. PointMCD: Boosting Deep Point Cloud Enrs via Multi-view Cross-modal Distillation for 3D Shape Recognition
keeganhk/pointmcd • • 7 Jul 2022
In this , we explore the possibility of boosting deep 3D point cloud enrs by transferring visual knowledge extracted from deep 2D
image enrs under a standard teacher-student distillation workflow.
3D Shape Classification 3D Shape Recognition +1
105. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
chanchimin/chateval • 14 Aug 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost.
Text Generation
106. SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
kernelmachine/silo-lm • • 8 Aug 2023
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public
domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.
g., containing copyrighted books or news) that is only queried during inference.
Language Modelling
107. Generative Prompt Model for Weakly Supervised Object Localization
callsys/genpromp • • 19 Jul 2023
During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to
conditionally recover the input image with noise and learn representative embeddings.
Ranked #1 on Weakly-Supervised Object Localization on CUB-200-2011 (Top-1 Localization Accuracy metric, using extra training
data)
Image Denoising Language Modelling +1
108. SynJax: Structured Probability Distributions for JAX
deepmind/synjax • • 7 Aug 2023
The models that explicitly account for structured objects, such as trees and segmentations, did not benefit equally because they
require custom algorithms that are difficult to implement in a vectorized form.
109. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
timdettmers/bitsandbytes • • 15 Aug 2022
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut
the memory needed for inference by half while retaining full precision performance.
Ranked #2 on Language Modelling on C4
Language Modelling Linguistic Acceptability +4
110. SSLRec: A Self-Supervised Learning Library for Recommendation
hkuds/sslrec • • 10 Aug 2023
Our SSLRec platform covers a comprehensive set of state-of-the-art SSL-enhanced recommendation models across different
scenarios, enabling researchers to evaluate these cutting-edge models and drive further innovation in the field.
Collaborative Filtering Data Augmentation +2
111. Maximum Entropy Heterogeneous-Agent Mirror Learning
pku-marl/harl • • 19 Jun 2023
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years.
Multi-agent Reinforcement Learning
112. Token Merging for Fast Stable Diffusion
picsart-ai-research/text2video-zero • • 30 Mar 2023
In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5. 6x.
Image Generation
113. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
labmlai/annotated_deep_learning__implementations • • BigScience (ACL) 2022
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be
made freely and openly available to the public through a permissive license.
Ranked #52 on Multi-task Language Understanding on MMLU
Language Modelling
114. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
idea-research/groundingdino • • 9 Mar 2023
To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a
tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality der for cross-
modality fusion.
Ranked #1 on Zero-Shot Object Detection on ODinW
object-detection Referring Expression +3
115. In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
xhan77/in-context-alignment • • 8 Aug 2023
In this note, we explore inference-time alignment through in-context learning.
Language Modelling
116. LATR: 3D Lane Detection from Monocular Images with Transformer
jmoonr/latr • • 8 Aug 2023
On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance the lane
information.
3D Lane Detection Autonomous Driving
117. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
sjtu-lit/ceval • • 15 May 2023
We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning
abilities of foundation models in a Chinese context.
Multiple-choice
118. Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
bytedance/fc-clip • • 4 Aug 2023
The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary
classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution
than the one used during contrastive image-text pretraining.
Ranked #1 on Open Vocabulary Semantic Segmentation on PascalVOC-20b
Open Vocabulary Panoptic Segmentation Open Vocabulary Semantic Segmentation +1
119. The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings
TimMerino1710/five-dollar-model • • 8 Aug 2023
The five-dollar model is a lightweight text-to-image generative architecture that generates low dimensional images from an end
text prompt.
Sentence Embeddings
120. CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society
lightaime/camel • 31 Mar 2023
To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-
playing.
Language Modelling
121. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
winfredy/sadtalker • • CVPR 2023
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly
modulates a novel 3D-aware face render for talking head generation.
Talking Head Generation
122. Aggregated Contextual Transformations for High-Resolution Image Inpainting
zyddnys/manga-image-translator • • 3 Apr 2021
For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task.
Ranked #6 on Image Inpainting on Places2
Image Inpainting Texture Synthesis +1
123. Recognize Anything: A Strong Image Tagging Model
xinyu1205/Recognize_Anything-Tag2Text • • 6 Jun 2023
We are releasing the RAM at \url{https://recognize-anything. github. io/} to foster the advancements of large models in computer
vision.
Semantic Parsing
124. Tree of Thoughts: Deliberate Problem Solving with Large Language Models
ysymyth/tree-of-thought-llm • 17 May 2023
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to
token-level, left-to-right decision-making processes during inference.
Decision Making Language Modelling
125. Memory Transformer
lucidrains/x-transformers • • 20 Jun 2020
Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to
improve the Transformer model.
Language Modelling Machine Translation +4
126. One Embedder, Any Task: Instruction-Finetuned Text Embeddings
shibing624/text2vec • • 19 Dec 2022
Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge
of training a single model on diverse datasets.
Information Retrieval Learning Word Embeddings +3
127. Factuality Enhanced Language Models for Open-Ended Text Generation
NVIDIA/FasterTransformer • • 9 Jun 2022
In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation.
Misconceptions Sentence Completion +1
128. PoissonNet: Resolution-Agnostic 3D Shape Reconstruction using Fourier Neural Operators
arsenal9971/poissonnet • • 3 Aug 2023
Furthermore, we demonstrate that the Poisson surface reconstruction problem is well-posed in the limit case by showing a
universal approximation theorem for the solution operator of the Poisson equation with distributional data utilizing the Fourier
Neural Operator, which provides a theoretical foundation for our numerical results.
3D Shape Reconstruction Super-Resolution +1
129. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation
fedenoce/s2l-s2d • • 2 Jun 2023
This presents a novel approach for generating 3D talking heads from raw audio inputs.
3D Face Animation Talking Head Generation
130. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
mit-han-lab/llm-awq • • 1 Jun 2023
Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the
hardware barrier for serving (memory size) and slows down token generation (memory bandwidth).
Common Sense Reasoning Language Modelling +1
131. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases
pku-yuangroup/chatlaw • • 28 Jun 2023
Furthermore, we propose a self-attention method to enhance the ability of large models to overcome errors present in reference
data, further optimizing the issue of model hallucinations at the model level and improving the problem-solving capabilities of large
models.
Language Modelling Retrieval
132. BAA-NGP: Bundle-Adjusting Accelerated Neural Graphics Primitives
IntelLabs/baa-ngp • • 7 Jun 2023
Implicit neural representation has emerged as a powerful method for reconstructing 3D scenes from 2D images.
3D Scene Reconstruction Novel View Synthesis +2
133. Gentopia: A Collaborative Platform for Tool-Augmented LLMs
gentopia-ai/gentopia • 8 Aug 2023
We present gentopia, an ALM framework enabling flexible customization of agents through simple configurations, seamlessly
integrating various language models, task formats, prompting modules, and plugins into a unified paradigm.
134. Voyager: An Open-Ended Embodied Agent with Large Language Models
MineDojo/Voyager • 25 May 2023
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world,
acquires diverse skills, and makes novel discoveries without human intervention.
135. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors
guochengqian/magic123 • • 30 Jun 2023
We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed
image in the wild using both2D and 3D priors.
Image to 3D
136. Neural c Language Models are Zero-Shot Text to Speech Synthesizers
suno-ai/bark • • 5 Jan 2023
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
Language Modelling Speech Synthesis +1
137. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization
One-2-3-45/One-2-3-45 • 29 Jun 2023
Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world.
3D Reconstruction Image to 3D +2
138. Exploring Predicate Visual Context in Detecting of Human-Object Interactions
fredzzhang/pvic • • 11 Aug 2023
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research.
Ranked #1 on Human-Object Interaction Detection on HICO-DET
Human-Object Interaction Detection
139. Vocos: Closing the gap between time-domain and Fourier-based neural vors for high-quality audio synthesis
charactr-platform/vocos • • 1 Jun 2023
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the
time-domain.
Inductive Bias
140. UniVTG: Towards Unified Video-Language Temporal Grounding
showlab/univtg • • 31 Jul 2023
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval
(time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
Ranked #1 on Highlight Detection on QVHighlights (using extra training data)
Highlight Detection Moment Retrieval +2
141. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
mlfoundations/open_flamingo • • 2 Aug 2023
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters.
142. AltCLIP: Altering the Language Enr in CLIP for Extended Language Capabilities
automatic1111/stable-diffusion-webui • • 12 Nov 2022
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal
representation model.
Ranked #1 on Zero-Shot Transfer Image Classification on CN-ImageNet-Sketch
Contrastive Learning Cross-Modal Retrieval +10
143. Turning Whisper into Real-Time Transcription System
ufal/whisper_streaming • • 27 Jul 2023
Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for
real time transcription.
speech-recognition Speech Recognition +1
144. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Yujun-Shi/DragDiffusion • • 26 Jun 2023
In this work, we extend such an editing framework to diffusion models and propose DragDiffusion.
145. SkiROS2: A skill-based Robot Control Platform for ROS
rvmi/skiros2 • • 29 Jun 2023
The need for autonomous robot systems in both the service and the industrial domain is larger than ever.
Scheduling
146. PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs
manycore-research/PlankAssembly • • 10 Aug 2023
In this , we develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models.
3D Reconstruction
147. A Survey on Evaluation of Large Language Models
mlgroupjlu/llm-eval-survey • 6 Jul 2023
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented
performance in various applications.
Ethics
148. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
lm-sys/fastchat • • 9 Jun 2023
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of
existing benchmarks in measuring human preferences.
Chatbot Language Modelling
149. Multi-scale Multi-band DenseNets for Audio Source Separation
Anjok07/ultimatevocalremovergui • • 29 Jun 2017
This deals with the problem of audio source separation.
Audio Source Separation Music Source Separation
150. Retentive Network: A Successor to Transformer for Large Language Models
microsoft/torchscale • • 17 Jul 2023
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously
achieving training parallelism, low-cost inference, and good performance.
Language Modelling
151. AgentBench: Evaluating LLMs as Agents
thudm/agentbench • 7 Aug 2023
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond
traditional NLP tasks.
Decision Making